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Abstract 



We study the basic problem of robust subspace recovery. That is, we assume a data set that some 
of its points are sampled around a fixed subspace and the rest of them are spread in the whole am- 
& ' bient space, and we aim to recover the fixed underlying subspace. We first estimate "robust inverse 

^ , sample covariance" by solving a convex minimization procedure; we then recover the subspace by 

the bottom eigenvectors of this matrix (their number correspond to the number of eigenvalues close 
to 0). We guarantee exact subspace recovery under some conditions on the underlying data. Fur- 
thermore, we propose a fast iterative algorithm, which linearly converges to the matrix minimizing 
C*~) . the convex problem. We also quantify the effect of noise and regularization and discuss many other 

1 practical and theoretical issues for improving the subspace recovery in various settings. When re- 

placing the sum of terms in the convex energy function (that we minimize) with the sum of squares 
of terms, we obtain that the new minimizer is a scaled version of the inverse sample covariance 
(ST (when exists). We thus interpret our minimizer and its subspace (spanned by its bottom eigenvec- 

tors) as robust versions of the empirical inverse covariance and the PCA subspace respectively. We 
compare our method with many other algorithms for robust PCA on synthetic and real data sets and 
demonstrate state-of-the-art speed and accuracy. 

Keywords: Principal components analysis, robust statistics, M-estimator, iteratively re-weighted 
' least squares, convex relaxation 

1. Introduction 

The most useful paradigm in data analysis and machine learning is arguably the modeling of data by 
a low-dimensional subspace. The well-known total least squares solves this modeling problem by 
finding the subspace minimizing the sum of squared errors of data points. This is practically done 
via principal components analysis (PCA) of the data matrix. Nevertheless, this procedure is highly 
sensitive to outliers. Many heuristics have been proposed for robust recovery of the underlying 
subspace. Recent progress in the rigorous study of sparsity and low-rank of data has resulted in 
provable convex algorithms for this purpose. Here, we propose a different rigorous and convex 
approach, which is a special M-estimator. 



R obustness of statistica l estimators has been carefully studied for several decades (IHuber and Ronchetti, 

2009 : Maronna et al. , 2006 ). A classical example is the robustness of the geometric median dLopuhaa and Rousseeuw , 
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199 lh . For a data set X = {xj 
function of y € ~S, D : 



N 



C 



, the geometric median is the minimizer of the following 



N 
i=l 



(1) 



where || • || denotes the Euclidean norm. This is a typical example of an M-estimator, that is, a 
minimizer of a function of the form Yli=i p( r i)< where r,- L is a residual of the ith data point, Xj, from 
the parametrized object we want to estimate. Here, r, = ||y — Xj||, p(x) = \x\ and we estimate 
y € M D , which is parametrized by its D coordinates. 

There are several obstacles in developing robust and effective estimators for subspaces. For 
simplicity, we discuss here estimators of linear subspaces and thus assume that the data is centered 
at the originQ A main obstacle is due to the fact that the set of ci-dimensional linear subspaces in 
M. D , i.e., the Grassmannian G(D,d), is not convex. Therefore, a direct optimization on G(D,d) 
(or a union of G(D, d) over different d's) will not be convex (even not geodesically convex) and 
may result in several (or many) local minima. Another problem is that extensions of simple robust 
estimators of vectors to subspaces (e.g., using Zi-type averages) can fail by a single far away outlier. 
For example, one may extend the d-dimensional geometric median minimizing (Q]) to the minimizer 
over L G G(D, d) of the function 



N 



N 



(2) 



1=1 



i=l 



where LA is the orthogonal com plement of L and Pt, and P T ± are the ortho gonal projections on 
L and L 1 - respectively (see e.g., Ding et al. . 20061 : Lerman and Zhang . 2010 ). However, a single 
outlier with arbitrarily large magnitude will enforce the minimizer of © to contain it. 

The first obstacle can be resolved by applying a convex relaxation of the minimization of © 
so that subspaces are mapped into a convex set of matrices (the objective fun ction may be adapted 
respectively). Indeed, the subspace recovery proposed by Xu et al. ( 2010bl) can be interpreted in 
this way. Their objective function has one component which is similar to (|2]), though translated 
to matrices. They avoid the second obstacle by introducing a second component, which penalizes 
inliers of large magnitude (so that outliers of large magnitude may not be easily identified as inliers). 
However, the combination of the two components involves a parameter that needs to be carefully 
estimated. 

Here, we suggest a different convex relaxation that does not introduce arbitrary parameters and 
its implementation is significantly faster. However, it introduces some restrictions on the distribu- 
tions of inliers and outliers. Some of these restrictions have analogs in other works (see e.g., §2.21) . 
while others are unique to this framework. We clarify them later on. 



1.1 Previous Work 



Many algorithms (or pure estimators) have been pro posed for r obust s ubspace estimation or equiv- 
alently robust low rank approximation of matrices. iMaronnal (119761) . iHuber and Ronchetti (2009, 



1 . This is a common assumption to reduce the co mplexi ty of the subspace recove ry problem dCandes et all 1201 ll ; 
IXu et al.Ll2010b1 . l2012l ; lMcCov and Tropcll201lh . where iMcCov and Tropd d201ll) suggest centering by the geomet- 
ric median. Nevertheless, our methods easily adapt to affine subspace fitting by simultaneously estimating both the 
offset and the shifted linear component, but the justification is a bit more complicated then. 
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8).|Pevlinetal.ldl98lh . baviesl dl987l) . Ku and Yuilld d 19951) . ICroux and Haesbroeckl d2000h and 



Maronna et al 



d200d. ;6) estima te a robust covariance matrix. Some of these methods use M- 



estimators ( Maronna et al, . 2006L §6) and co mpute t hem v ia iteratively re- weighted least squares 



(IRLS) algorithms, which linearly converge (lArslanl . 120041) . The convergence of algorithms based 
on other estimators or strategies is not as satisfying. The objective function s of the MCD (Minimum 
Covariance Determinant) and S -estimators converge (IMaronna et al.l . l2006l . §6), but no convergence 
rates are specified. Moreover, there are no guarantees for the actual convergence to the global op- 
timum of these objective functions. Th ere is no good algorit hm for the MVE (Minimum Volume 



Ellipsoid) or Stahel-Donoho estimators (IMaronna et all 120061. %). Furthermore, convergence anal 



ysis is problematic for the online algorithm of IXu and Yuilld (11995b . 

_ Li and Chenl d 19851 ) JAmmannl d 19931 ). ICroux et all d2007l ). lKwakl d2008l) and lMcCov and Troppl 



referred 
p. 204). 



(1201 11 . §2) find low-dimensional projections by "Projection P ursuit" (PP), now common! 
to as PP-PCA (the initial proposal is due to Huber, see e.g., iHuber and Ronchettil . 12009 
The PP-PCA procedure is based on the observation that PCA maximizes the projective variance 
and can be implemented incrementally by computing the residual principal component or vector 
each time. Consequently, PP-PCA replaces this variance by a more robust function in this incre- 
mental implementation. Most PP-based methods are based on non-c onvex optimization and con- 
sequently lack satisfying guarant ees. In particu lar, ICroux et al.1 (|2007l ) do not analyze convergence 
of their non-convex PP-P CA and iKwaH d2008l) only establishes convergence to a local maximum. 
McCoy and Troppl ( 2011 . §2) suggest a convex relaxation for PP-PCA. However, they do not guar- 



antee that the output of their algorithm coincides with the ex act maxim i zer of their energy (though 
they show that the energies of the two are sufficiently close). lAmmannl d 19931) applies a minimiza- 
tion on the sphere, which is clearly not convex. It iteratively tries to locate vectors spanning the 
orthogonal complement of the underlying subspace, i.e., D — d vectors for a subspace in G(D, d). 
We remark that our method also suggests an optimization revealing the orthogonal compl ement, but 



it requ ires a single convex optimization, which is completely different from the method of lAmmann 
dl993l) . 



Torre and Blackl d200ll. liooih . iBrubaken d2009l) and Ku et al.l d2010al) remove possible outliers, 



followed by e stimation of the u nderlying subspace by PCA. These methods are highly non-convex. 
Nevertheless, Ku et al.1 d2010al) provide a probabilistic analysis for their near recovery of the under- 
lying subspace. 

The non-convex minimization of (f2]) as a robust alternative for p rincipal component analysis 



was suggested earlier by various autho r s for hyperplane modeling dOsborne and Watsonl. 1 1985 



Spath and Watson . 1987 : Nyquist . 1988 : Bargiela and Hartley . 1993 ). surface model ing dWatson 



2001 LI: 
2009I) . 



2001 . 20021 ). subspace modeling dDing et all 120061) and multiple subspaces modeling dZhang et al 



20091) . This minimizatio n also appeared in a pure g eometric-analytic context of gen eral surface 
modeling without outliers dDavid and Semmesl.ll99ll ). iLerman and Zhangld201(ll . 120111 ) have shown 
that this minimization can be robust to outliers under some conditions on the sampling of the data. 



Ke and Kanadd (|2003r) tried to minimize (over all low-rank approximati ons) the element-wise 



U nor m of the difference of a g iven matrix and its low -rank approximation. IChandrasekaran et al. 



d2009j) and lCandes et al.1 d201 ll) have proposed to minimize a linear combination of such an l\ norm 
and the nuclear norm of the low-rank approximation in order to find the optimal low-rank estima- 



tor. 



Candes et al.l (1201 ll) considered the setting where uniformly sampled elements of the low-rank 



matrix are corrupt ed, which does not apply to o ur outlier model (where only some of the rows are 



totally corrupted). IChandrasekaran et al 



d2009h consider a general setting, though their underlying 
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condition is too restrictive; weaker con dition was suggested by iHsu et all (120 lib, though it is still 
not sufficiently general. Nevertheless, IChandrasekaran et all (120091) and ICandes et al.1 (1201 11) are 
ground-breaking to the whole area, since they provide rigorous analysis of exact low-rank recovery 

wit h unspecified ran k. 

Xu et al ( 2010b ) and McCoy and Troppl ( 2011 ) have suggested a strategy analogous to Chandrasekaran et al 



(120091) and lCandes et all (1201 ll) to solve the outlier problem. They divide the matrix X whose rows 
are the data points as follows: X = L + O, where L is low-rank and O represents outliers (so that 
only some of its rows are non-zero). They minimize ||L||* + A||0||( 2) i), where || • ||* and || • ||(2,i) 
denote the nuclear norm and sum of Z 2 norms of rows respectively and A is a parameter that needs 
to be carefully chosen. We note that the term ||0||( 2 ,i) is analogous to ©. IXu et al.l (120121) have 
established an impressive theory showing that under some incoherency conditions, a bound on the 
fraction of o utliers and correct cho ice of the parameter A, they can exactly recover the low-rank ap- 



proximation. iHsu et all (1201 lh andlAg arwal et a 



as well as for the ones of 



d2011ah improved error boun ds for this estimator 



Chandrasekaran et al.l (120091) andlCandes et all (1201 If). 



In p racti ce, the implementations by IChandrasekaran et al.1 (|2009l) . ICandes et al.l (1201 lh.lXu et al. 
( 2010b ) and McCoy and Tropp ( 2011 ) use the iterative procedure described by Lin et al. ( 2009b . 



The difference between the regularized objecti ve functions of the minimizer and its estimator ob- 
tained at the feth iteration is of order 0(k~ 2 ) dLin et al. l|2009|, Theorem 2.1). On the other hand, 
for our algorithm the convergence rate is of order 0(exp(— ck)) for some constant c (i.e., it linearly 
converges). This rate is the order of the Frobenius norm of the difference between the minimizer 
sought by our algorithm (formulated in © below) and its estimator obtained at the /cth iteration 
(it is also the order of the differ ence of the regularized objective functions of these two matrices). 
Recently, lAgarwal et all (1201 1 m showed that projected gradient descent algorithms for these esti- 
mators obtain linear convergence rates, though with an ad ditional statistical error. 

Our numerical algorithm can be categor ized as I RLS. IWeiszfeldl (119371) used a procedure sim- 
ilar to ours to find the geometric median. Lawsonl later used it to solve uniform approximation 
problems by the limits of weighted / p -norm solutions. This proced ure was generalized to vari- 
ous minimization pro blems, in particular, it is native to M-estimators (|Huber and Ronchettil. 12009: 
Maronna et all I2006Q. and its linear convergence was p roved for special instances (see e.g.,|Cline, 
19721 : IVoss and Eckhardl Il980l : Ichan and Muletl.ll999h. Recently IRL S algorithms we re also ap- 
plied to sparse recovery and matrix completion ( Daubechies et al , 2010l : Fornasier et al. 



1.2 This Work 

We suggest another convex relaxation of the minimization of ©. We note that the original mini- 
mization is over all subspaces L or equivalently all orthogonal projectors P = P L ± . We can identify 
P with a D x D matrix satisfying P 2 = P and P T = P (where - T denotes the transpose). Since 
the latter set is not convex, we relax it to include all symmetric matrices, but avoid singularities by 
enforcing unit trace. That is, we minimize over the set: 



:={QG 



DxD 



:Q = Q T ,tr(Q) = l} 



(3) 



as follows 



N 



Q = argminF(Q), whereF(Q) := ||Qxj 

QeH 



(4) 
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For the noiseless case (i.e., inliers lie exactly on L*), we estimate the subspace L* by 

L:=ker(Q). (5) 

If the intrinsic dimension d is known (or can be estimate from the data), we estimate the subspace 
by the span of the bottom d eigenvectors of Q. This procedure is robust to sufficiently small levels 
of noise. We refer to it as the Geometric Median Subspace (GMS) algorithm and summarize it in 
Algorithm Q] We elaborate on this scheme throughout the paper, 

Algorithm 1 The Geometric Median Subspace Algorithm 

Input: X = {xj}^ 1 C ]R D : data, d: dimension of L*, an algorithm for minimizing (01) 
Output: L: a <i-dimensional linear subspace in M. D . 
Steps: 

• { v i}f=i = tne bottom d eigenvectors of Q (see (@])) 

.L = Sp({v,}f =1 ) 

We remark that Q is semi-definite positive (we verify this later in Lemma [16]). We can thus 
restrict H to contain only semi-definite positive matrices and thus make it even closer to a set of 
orthogonal projectors. Theoretically, it makes sense to require that the trace of the matrices in EI 
is D — d (since they are relaxed versions of projectors onto the orthogonal complement of a d- 
dimensional subspace). However, scaling of the trace in © results in scaling the minimizer of (0]) 
by a constant, which does not effect the subspace recovery procedure. 

We note that © is an M-estimator with residuals = ||QXj||, 1 < i < N, and p(x) = \x\. 
Unlike ©, which can also be seen as a formal M-estimator, the estimator Q is unique under a weak 
condition that we will state later. 

We are unaware of similar formulatio ns for the problem of robust PCA. Nevertheless, the Low- 
Rank Representation (LRR) framework of Liu et al. ( 2010l) for modeling data by multiple subspaces 



(and not a single subspace as in here) is formally similar. LRR tries to assign to a data matrix X, 
which is viewed as a dictionary of N column vectors in R D , dictionary coefficients Z by minimizing 
A||Z||* + ||(X(I — Z)) T ||( 2j i) over all Z £ M NxN , where A is a free parameter. Our formulation 
can be obtained by their formulation with A = 0, Q = (I — Z) T and the additional constraint 
tr(Z) = D — 1 (which is equivalent with the scaling tr(Q) = 1), where {x.i}f =l are the row 
vectors of X (and not the column vectors that represent the original data points). In fact, our work 
provides some intuition for LRR as robust recovery of the low rank row space of the data matrix 
and its use (via Z) in partitioning the column space into multiple subspaces. We also remark that 
a trace 1 constraint is quite natu ral in convex relaxation problems and was applied, e.g., in the 



convex relaxation of sparse PCA (Id'Aspremont et all, 120071) . though the optimization problem there 
is completely different. 

Our formulation is rather simple and intuitive, but results in the following fundamental contri- 
butions to robust recovery of subspaces: 

1. We prove that our proposed minimization can achieve exact recovery under some assump- 
tions on the underlying data (which we clarify below) and without introducing an additional 
parameter. 

2. We propose a fast iterative algorithm for achieving this minimization and prove its linear 
convergence. 
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3. We demonstrate the state-of-the-art accuracy and speed of our algorithm when compared with 
other methods on both synthetic and real data sets. 

4. We establish the robustness of our method to noise and to a common regularization of IRLS 
algorithms. 

5. We explain how to incorporate knowledge of the intrinsic dimension and also how to estimate 
it empirically. 

6. We show that when replacing the sum of norms in (@]) by the sum of squares of norms, then 
the modified minimizer Q is a scaled version of the empirical inverse covariance. The sub- 
space spanned by the bottom d eigenvectors is clearly the ci-dimensional subspace obtained by 
PCA. The original minimizer of (@]) can thus be interpreted as a robust version of the inverse 
covariance matrix. 



7. 



We show that previous and w ell-known M-estimators ( Maronnal 1976 : Huber and Ronchetti . 
20091 : iMaronna et all . 120061) do not solve the subspace recovery problem under a common 



assumption. 

At last, let us clarify on a non-technical level our assumptions on the data (technical discussion 
appears in $2). The exact recovery problem assumes a fixed ci-dimensional linear subspace L*, 
inliers sampled from L* and outliers sampled from its complement; it asks to recover L* as well 
as identify correctly inliers and outliers. We basically require three kinds of restrictions on the 
underlying data. First of all, the inliers need to permeate through the whole underlying subspace L*, 
in particular, they cannot concentrate on a lower dimensional subspace of L*. Second of all, outliers 
need to permeate throughout the whole complement of L*. This assumption is rather restrictive. 
We thus show that it is not needed when d is known. We also suggest some practical methods to 
avoid it when d is unknown. Third of all, the "magnitude' of outliers needs to be restricted. We 
may initially scale all points to the unit sphere in order to avoid extremely large outliers. However, 
we still need to avoid outliers concentrating along lines, which may have an equivalent effect of a 
single arbitrarily large outlier. Figure Q] (which appears later in © demonstrates cases where these 
assumptions are not satisfied. 



1.3 Structure of This Paper 

In $2]we establish exact and near subspace recovery via the GMS algorithm or (f5]). We also carefully 
explain the common obs tacles for robust subspace recovery and the way t hey are handled by pre- 
vious rigorous solutions ( Candes et al. , 2011 : Chandrasekaran et al. , 20091 : Xu et al. , 2012 ) as well 
as our solution. Section [3] emphasizes the significance of o ur M-estimator in two different ways . 
First of all, it shows that a w ell-known class of M-estimators ( Maronna . 19761 : Huber and Ronchetti . 
20091 : IMaronna et al.LEo06h cannot solve the subspace recovery problem (under a common assump- 
tion). Second of all, it shows that non-robust adaptation of our M-estimator provides both direct 
estimation of the inverse covariance matrix as well as convex minimization equivalent to the non- 
convex total least squares (this part requires full rank data and thus a possible initial dimensionality 
reduction but without any loss of information). We thus interpret © as a robust estimation of the 
inverse covariance. In <@] we propose an IRLS algorithm for minimizing (@]) and establish its linear 
convergence. Section [5] discusses practical versions of the GMS procedure or (f5]) that allow more 
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general distributions than the ones guaranteed by the theory. One of these versions, the Extended 
GMS (EGMS) even provides robust alternative to principal components. In ^6] we demonstrate the 
state-of-the-art accuracy and speed of our algorithm when compared with other methods on both 
synthetic and real data sets and also numerically clarify some earlier claims. Section |7] provides all 
details of the proofs and ^concludes with brief discussion. 



2. Exact and Near Subspace Recovery by the Minimization © 
2.1 Problem Formulation 

For the exact recovery problem, we assume a linear subspace L* G G(D,d) and a data set X = 
{xj }fL 1 , which contains inliers sampled from L* and outliers sampled from R D \ L* . Given the data 
set X and no other information, the objective of this problem is to exactly recover the underlying 
subspace L* . In order to make the problem well-defined, one needs to assume some conditions on 
the sampled data set, which may vary with the proposed solution. We emphasize that this is a for- 
mal mathematical problem, which excludes some ambiguous scenarios and allows us to determine 
admissible distributions of inliers and outliers. 

In the noisy case (where inliers do not lie on L*, but perturbed by noise), we ask about near 
subspace recovery, i.e., recovery up to an error depending on the underlying noise level. We argue 
below that in this case additional information on the model is needed. Here we assume the knowl- 
edge of d, though under some assumptions we can estimate d from the data (as we demonstrate 
later). We remark that exact asymptotic recovery under some conditions on the noise distri bution is 
way more complicated and will be discussed in another work ( Coudron and Lerman . 2012|) . 



2.2 Common Difficulties with Subspace Recovery 

We introduce here three typical enemies of subspace recovery and exemplify them i n Figured] We 

also explain how they are handled by the pr evious convex solutions for exact recovery (|Chandrasekaran et al 



20091 : ICandes et all 1201 ll i lXu et all l2012h . 



A type 1 enemy occurs when the inliers are mainly sampled from a subspace L' C L*. In this 
case, it seems impossible to recover L* . We would expect a good algorithm to recover L' (instead 
of L*) or a subspace contai n ing it with slightly high er dimension (s ee for example Figure [T(a)| ). 
Chandrasekaran et al.1 (120091) . ICandes et all (1201 lb and Ku et all (120121) have addressed this issue by 
requiring incoherence conditions for the inliers. For example, if m a nd N — m point s are sampled 
from L' and L* \ L' respectively, then the incoherency condition of Ku et al.1 (120121) requires that 
fJ' > A r /(dim(L*) • (N — m)), where [i is their incoherency parameter. That is, their theory holds 
only when the fraction of points sampled from L* \ L' is sufficiently large. 

A type 2 enemy occurs when the outliers are mainly sampled from a subspace L such that 
dim(L © L*) < D. In this case L* © L can be mistakenly identified as the low-rank subspace 
(see for example Figure |l(b)[ ). This is a main issue when the intrinsic dimension is unknown; 
if on the other hand the intrinsic dimension is known, then one can often overcome this enemy. 
Candes et al.l (1201 ll) handle it by assuming that the distribution of corrupted elements is uniform. 
Chandrasekaran et al.1 (|2009r) address it by restricting their parameter \x (see their main condition, 
which is used in Theorem 2 of their work, and their definition of /i in (1.2) of their work) and 
consequently li mit the values of the mixture parameter (denoted here by A). On the other hand, 
Xu et all (120 12l) use the true percentage of outliers to infer the right choice of the mixture parameter 
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(a) Example of a type 1 enemy: L* is a plane repre- 
sented by a rectangle, "inliers" (in L*) are colored blue 
and "outliers" (in R 3 \ L*) red. Most inliers lie on a line 
inside L* . It seems unlikely to distinguish between in- 
liers, which are not on "the main line", and the outliers. 
It is thus likely to recover the main line instead of L* . 



(b) Example of a type 2 enemy: L* is a line represented 
by a black line segment, "inliers" (in L*) are colored 
blue and "outliers" (in R 3 \ L*) red. All outliers but one 
lie within a plane containing L*, which is represented 
by a dashed rectangle. There seems to be stronger dis- 
tinction between the points on this plane and the isolated 
outlier than the original inliers and outliers. Therefore, 
an exact recovery algorithm may output this plane in- 
stead of L*. 



-0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 



(c) Example 1 of a type 3 enemy: The inliers (in blue) lie 
on the line L* and there is a single outlier (in red) with 
relatively large magnitude. An exact recovery algorithm 
can output the line L (determined by the outlier) instead 
of L*. If the data is normalized to the unit circle, then 
any reasonable robust subspace recovery algorithm can 
still recover L* . 



(d) Example 2 of a type 3 enemy: Points are normalized 
to lie on the unit circle, inliers (in blue) lie around the 
line L* and outliers (in red) concentrate around another 
line, L. A subspace recovery algorithm can output L 
instead of L*. 



Figure 1 : Enemies of the mathematical formulation of exact subspace recovery. 
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A. That is, they practically invoke model selection (for estimating this percentage) in order to reject 
L © L* and choose the true model, which is L* . 

A type 3 enemy occurs due to large magnitudes of outliers. For example, a single outlier with 
arbitrarily large magnitude will be contained in the minimizer of (O, which will thus be different 



than the underlying subspace (see for example Figure fl(c)] ). Also, many outliers with not-so-small 
magnitudes that lie around a fixed line ma y have the effect of a single la r ge outlier (see for ex am- 
ple Figur e |l(d) |). This enemy is avoided by IChandrasekaran et al. d2009h . Icandes et all d201 lh and 
Xu et al.l (120 120 by the additional mixture component of nuclear norm, which penalizes the magni- 
tude (or combined magnitude) of the supposed inliers (so that outliers of large magnitude may not 
be easily identified as inliers). It is interesting to note that if the rank is used instead of the nuclear 
norm (as sometimes advocated), then it will not resolve this issue. 

Another issue for our mathematical problem of exact subspace recovery is whether the subspace 
obtained by a proposed algorithm is unique. Many of the convex algorithms depend on convex l\- 
type methods that may not be strictly convex. But it may still happen that in the setting of pure 
inliers and outliers and under some conditions avoiding the three types of enemies, the recovered 
subspace is uniqu e (even though it may be ob t ained by several no n -unique minimiz ers). This is 
indeed the case in Chandrasek aran et al. d2009h . ICandes et al.l (1201 ll) . IXu et al.l (120121) and our own 
work. Nevertheless, uniqueness of our minimizer (and not the recovered subspace) is important 
for analyzing the numerical algorithm approximating it and for perturbation analysis (e.g., when 
considering near recovery with noisy data). It is also helpful for practically verifying the conditions 
we will propose for exact recovery. Uniquen ess o f the minimizer (an d not just the subspace) is 
also important in IChandrasekaran et all (|2009T) and ICandes et al.l (|201 ll) and they thus established 
conditions for it. 

At last, we comment that subspace recovery with unknown intrinsic dimension may require 
a model selection procedure (possibly implicitly). That is, even though one can provide a theory 
for exact subspace recovery (under some conditions), which might be stable to perturbations, in 
practice, so me form of model selection w ill b e necessary in no isy cases. For example, the impressive 
theories by Chandrasekaran etal (2009) and IXu et al.l (120121) require the estimation of the mixture 
parameter A. Xu et al. ( 20121) propose such an estimate for A, which is based on knowledge of the 
data set (e.g., the distribution of corruptions and the fraction of outliers). However, we noticed that 
in practice this proposal did not work well (even for simple synthetic examples), partly due to the 
fact that the deduced conditio ns are only sufficien t, not necessary and there is much room left for 
improvement. The theory by ICandes et al.l (|201ll) specified a choice for A that is independent of 
the model parameters, but it applies only for the special case of uniform corruption without noise; 
moreover, they noticed that other values of A could achieve better results. 



2.3 Conditions for Handling the Three Enemies 

We introduce additional assumptions on the data to address the three types of enemies. We denote 
the sets of exact inliers and outliers by X\ and Xq respectively, i.e., X\ =<YnL* and Xq = X \ L*. 
The following two conditions simultaneously address both type 1 and type 3 enemies: 



mm 
QeH,QP T * 



mm 
QeH,QP T „ 



=0 



=0 



E iiQ*ii 
E HQ*i 



> \pl min 

> \pl max 

veL*,||v|| 



T i 
V X 



, E |v T x|- 



(6) 



(7) 
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A lower bound on the common LHS of both © and © is designed to avoid type 1 enemies. 
This common LHS is a weak version of the permeance statistics, which was defined in (3.1) of 



Lerman et al.1 (|2012h as follows: 



V(L*):= mm £ |(u,x)|. 



ueL* 
||u||=i xgA"i 



Similarly to the permeance statistics it is zero if and only if all inliers are contained in a proper 
subspace of L* . Indeed, if all inliers lie in a subspace L' C L* , then this common LHS is zero with 
the minimizer Q = Pi/^nL* / tr(P L /x nL * )■ Similarly, if it is zero, then Qx = for any x G X\ and 
for some Q with kernel contained in L . This is only possible when X\ is contained in a proper 
subspace of L*. Similarly to the permeance statistics, if the inliers nicely permeate through L*, then 
this common LHS clearly obtain large values. 

The upper bounds on the RHS's of © and © address two complementing type 3 enemies. If 
Xq contains few data points of large magnitude, which are orthogonal to L* , then the RHS of © 
may be too large and © may not hold. If on the other hand Xq contains few data points with large 
magnitude and a small angle with L* , then the RHS of © will be large so that © may not hold. 
Conditions © and © thus complete each other. 

Th e RHS of condition © is similar to the linear structure statistics (for L*), which was defined in 
(3.3) of lLerman et al The linear structure statistics uses an I2 average of dot pr oducts instead 



of the l\ average used here and was applied in this context to ]R (instead of L*) in Lerman et al. 



(|2012r) . Similarly to the linear structure statistics, the RHS of © is large when outliers either have 
large magnitude or they lie close to a line (so that their combined contribution is similar to an outlier 
with a very large magnitude as exemplified in Figure \T(d)\ . The RHS of condition © is a very weak 
analog of the linear structure statics of L* ± since it uses a minimum instead of a maximum. There 
are some significant outliers within L* -1 that will not be avoided by requiring ©. For example, if 
the codimension of L* is larger than 1 and there is a single outlier with an arbitrary large magnitude 
orthogonal to L*, then the RHS of © is zero. 

The next condition avoids type 2 enemies and also significant outliers within L* -1 (i.e., type 3 
enemies) that were not avoided by condition ©. This condition requires that any minimizer of the 
following oracle problem 

Q := argmin F(Q) (8) 

QeH,QP L »=0 

satisfies 

rank(Qo) = D — d. (9) 

We note that the requirement QPl* = is equivalent to the condition ker(Q) 2 L* and therefore 
the rank of the minimizer is at most D — d. Enforcing the rank of the minimizer to be exactly D — d 
restricts the distribution of the projection of X onto L* ± . In particular, it avoids its concentration on 
lower dimensional subspaces and is thus suitable to avoid type 2 enemies. Indeed, if all outliers are 
sampled from L C L*- 1 , then any Qel with ker(Q) D L + L* satisfies F(Q) = and therefore 
it is a minimizer of the oracle problem ©, but it contradicts d9]). 

We note that this condition also avoids some type 3 enemies, which were not handled by condi- 
tions © and ©. For example, any D — d — 1 outliers with large magnitude orthogonal to L* will 
not be excluded by requiring © or ©, but will be avoided by d9]). 

This condition is restrictive though, especially in very high ambient dimensions. Indeed, it does 
not hold when the number of outliers is smaller than D — d (since then the outliers are sampled 
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from some L with dim(L © L*) < D). We thus explain in £]5.2l and £]5.3l how to avoid this condition 
when knowing the dimension. We also suggest in £15.11 some practical solutions to overcome the 
corresponding restrictive lower bound on the number of outliers when the dimension is unknown. 

Example 1 We demonstrate the violation of the conditions above for the examples depicted in Fig- 
ure^ The actual calculations rely on ideas explained in £]2.6I 

For the example in Figure \l(a)\ which represents a type 1 enemy, both conditions © and ([7]) 
are violated. Indeed, the common LHS of © and © is 5.69, whereas the RHS of © is 8.57 and 
the RHS of © is larger than 10.02 (this lower bound is obtained by substituting v = [0, 1, 0] in the 
RHS of ([7]); note that v is a unit vector in L*). 

For the example in Figure \l(b)\ which represents a type 2 enemy, condition © is violated. 
Indeed, we obtained numerically a solution Qo with rank(Qo) = \^D — d = 2 (one can actually 
prove in this case that Qo is the projector onto the orthogonal complement of the plane represented 
by the dashed rectangle). 

For the example in Figure 1(c) which represents a type 3 enemy, both conditions © and © are 



violated. Indeed, the common LHS of © and © is 1.56 and the RHS's of © and © are 5.66 and 
4.24 respectively. However, if we normalize all points to lie on the unit circle, then this enemy can 
be overcome. Indeed, for the normalized data, the common LHS of © and © is 6 and the RHS's 
of © and © are 1.13 and 0.85 respectively. 

For the example in Figure \l(d)\ which also represents a type 3 enemy, both conditions © and 
© are violated. Indeed, the LHS of © and © are 5.99 and the RHS's of © and © are 6.91 and 
7.02 respectively. 

2.4 Exact Recovery Under Combinatorial Conditions 

We show that the minimizer of © solves the exact recovery problem under the above combinatorial 
conditions. 

Theorem 1 Assume that d, D G N, d < D, X is a data set in W D and L* € G(D, d). If conditions 
©, © and © hold (w.r.t. X and L*J, then any minimizer of ©, Q, recovers the subspace L* in 
the following way: ker(Q) = L*. If only © and © hold, then ker(Q) D L*. 



2.5 Uniqueness of the Minimizer 

We recall that Theorem Q] implies that if ©, © and © hold, then ker(Q) is unique. Here we 
guarantee the uniqueness of Q (which is required in §2.61 £12.81 £12.91 and £14.21 ) independently of the 
exact subspace recovery problem. 

Theorem 2 If the following condition holds: 

{X n Li} U {X n L 2 } / X for all (D - 1) -dimensional subspaces Li,L 2 C M D , (10) 
then -F(Q) is a strictly convex function on EL 



2.6 Variants and Verifiability of Conditions ©, © and © 

We discuss the possibility of verifying conditions ©, © and ©, or variants of them, given L*. We 
can find a solution of © by Algorithm |2l which is developed later in £]4.3I for solving these types 
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of problems. We cannot find though all solutions of ([8]). Nevertheless, if we also assume (TT0T >. then 
this solution is unique. 

The LHS's of © and © can be calculated by applying Algorithm |2] to the data set {Pl*x : 
x G X m }, where P denotes the D x d projector corresponding to the D x D projector P onto a 
d-dimensional subspace, so that P = PP T . 

The RHS's of © and © cannot be easily calculated (unless D — d = 1 for © and d = 1 for 
© as was often the case in Example [T]). Therefore, we propose here two variants of © and © that 
can be easily verified. The first variant suggests stronger (i.e., more restrictive) conditions, which 
obviously imply © and ©: 



mm 

QeH,QP T „ 



IIQx|| > y/2 ]T ||P L .xx| 



(ID 



xeA"i 



and 



min V ||Qx|| > y/2 V ||P L *x||. 
qgh,qp l , ± =0. 



(12) 



xeA"i 



xeA'o 



The second variant suggests the following weaker conditions (i.e., less restrictive) that can still 
be used to imply exact recovery (as we show in £)7.3I ). For an arbitrarily chosen solution of ©, Qq: 



mm V ||Qx|| > y/2 
Q e H,QP L , ± =O x ^ i 



Q xx T P L »x/||Q x| 



(13) 



and 



mm V ||Qx|| > y/2 

QeH,QP Lt± =O x ^ i 



Q xx T P L ,/||Q x| 

xeA-o 



(14) 



There is one problem with these conditions: the arbitrary solution Qo of ® needs to satisfy 
Qo x 7^ for all x G Xq (otherwise the RHS's of ( fT3l ) and (fT4l are undefined). This problem 
is resolved when assuming condition © for exact recovery. Conditions (fT3l and (fT4l) are easy to 
check, since a solution Q can be found by Algorithm [2] 

At last, we remark that we can also obtain the following necessary conditions for the recovery 
of L* as ker(Q) (see comment at the end of §7.31 regarding their justification). For an arbitrarily 
chosen solution of ©, Qq: 



min V ||Qx|| > || V Q xx T P L „x/||Qox| 

Q6H,QP T< , X =0_ 



(15) 



xeAi 



xGAi 



and 



IIQ(Pl*x)|| > y (q,P^ ± Q xx t P l ./||Q x||) foranyQG 



o(D-d)xd 



(16) 



where for matrices A, B G M. kxl : (A, B}^ = tv(AB T ) is the Frobenius dot product. The com- 
ments above regarding conditions (fT3T > and ([141 also apply to (IT5b and ([TBI ). 



12 



A Novel M-Estimator for Robust PCA 



2.7 Exact Recovery under Probabilistic Models 

We show that our conditions for exact recovery (or partial exact recovery) and for uniqueness of the 
minimizer Q hold with high probability under two probabilistic models. 

First we assume a more general probabilistic model. We say that fj, on W D is an Outliers-Inliers 
Mixture (OIM) measure (w.r.t. the fixed subspace L* G G(D,d)) if /i = ao/^o + «iA*i» where 
oq, ai > 0, a> + «! = 1, ii\ is a sub-Gaussian probability measure and fi is a sub-Gaussian 
probability measure on M D (representing outliers) that can be decomposed to a product of two 
independent measures fiQ = fio.L* x Ho l* x sucn that the supports of fio.L* and /i L »x are L* and 
L* 1 - respectively, and /i L *± is spherically symmetric with respect to rotations within L* J 



To provide cleaner probabilistic estimates, we also invoke the needle-haystack model of lLerman et al 
(|2012r ). It assumes that both /io and fit are the Gaussian distributions: //q = N(0, a^I/D) and 



[i\ = N(0, £j{Pl*Pl*/^) (the factors 1/D and 1/d no rmalize the magnitud e of outliers and inliers 



respectively so that their norms are comparable). While lLerman et al.1 ((20121 ) assume a fixed number 
of outliers and inliers independently sampled from /j,q and respectively, here we independently 
sample from the mixture measure p, = a>op,o + "l/^i; we refer to \i as a needle-haystack mixture 
measure. 

In order to prove exact recovery under any of these models, one needs to restrict the fraction 
of inliers per outliers (or equivalently, the ratio ct\jao). We refer to this ratio as SNR (signal 
to noise ratio) since we may view the inliers as the pure signal and the outliers as some sort of 
"n oise". For the needle -haystack model we require the following SNR, which is similar to the one 



of iLerman et all (|2012h : 

a\ , on d 

— > 4— - (17) 
«o 0"i y/(D - d)D 

We later explain how to get rid of the term oi/oq. For the OIM model we assume the following 
more general condition: 

a x min f ||Qx|| d/ii(x) > 2V2-p— [ ||P l .j.x|| d/i (x). (18) 

Q6H,QP Lt± =0J L> - d J 

Under the needle-haystack model, this condition is a weaker version of (fT71) . That is, 

Lemma 3 If fi is a needle-haystack mixture measure, then (1171 ) implies (1181 ). 

For i.i.d. samples from an OIM measure satisfying (fT8T ). we can establish our modified con- 
ditions of unique exact recovery (i.e., ([131 , (fl4l ) and (O) with overwhelming probability in the 
following way (we also guarantee the uniqueness of the minimizer Q). 

Theorem 4 If X is an i.i.d. sample from an OIM measure fi satisfying (118 b . then conditions (113 b . 
(|14l) . and © hold with probability 1 — C exp(—N/C), where C is a constant depending on f/, and 
its parameters. Moreover, (1101 ) holds with probability 1 if there are at least 2D — 1 outliers (i.e., the 
number of points in X \ L* is at least 2D — 1). 

Under the needle-haystack model, the SNR established by Theorem[4]is comparable to the best 
SNR among other convex exact recovery algorithms (this is later clarified in Table []]). However, the 
probabilistic estimate under which this SNR holds is rather loose and thus its underlying constant 
C is not specified. Indeed, the proof of Theorem [4] uses e-nets and union-bounds arguments, which 
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are often not useful for deriving tight probabilistic estimates (see e.g. jMendelsonl (120031 . page 18)). 
One may thus view Theorem [4] as a near- asymptotic statement. 

The statement of Theorem |4] does not contradict our previous observation that the number of 
outliers should be larger than at least D — d. Indeed, the constant C is sufficiently large so that the 
corresponding probability is negative when the number of outliers is smaller than D — d. 

In the next theorem we assume only a needle-haystack model and thus we can provide a stronger 
proba bilistic estima t e base d on the concentration of measure phenomenon (our proof follows di- 
rectly |LermanetalJ (120121) ). However, the SNR is worse than the one in Theorem [4] by a factor of 
order y/D — d. This is because we are unable to estimate Qo of © by concentration of measure. 
Similarly, in this theorem we do not estimate the probability of © (which also involves Qo). Never- 
theless, we observed in experiments that © holds with high probability for Nq = 2(D — d) and the 
probability seems to go to 1 as No = 2(D — d) and D — d — > oo. Moreover, one of the algorithms 
proposed below (EGMS) does not require condition ©. 

Theorem 5 If X is an Ltd. sample of size N from a needle-haystack mixture measure fi and if 



a 



'2/7T- 1/4 



o-i ^/2Jtt + 1/4 + 1/10 V D 




and 



N > 64 max(2d/ax,2d/a ,2(D -d)/a ), 
then © and © hold with probability 1 - e~ a i N / 2 - 2e~ a o N / 2 - e -<*iN/aao 



(19) 



(20) 



-a N/800 



In Table Q] we present the theoretical asymptotic SNRs for exact recovery of some recent algo- 
rithms. We assu me the needle-haystack model with fixed d, D, a , a±, o"o and o~\ and N — > oo. 
Xu et all (boiOal) established the SNR for their High-dimensional Robust PCA (HR-PCA) algo- 
rithm in Remark 3 of their work. IXu et al.1 (|2012h established the SN R for their Outlier Pursuit ( OP) 
algorithm (equivalently the Low-Leverage Decomposition (LLD) of iMcCoy and Troppl (120111) ) in 
Theorem 1 of their work. Their analysis assumes a deterministic condition, but it is possible to show 
that this condition is asymptotically valid under the needle-haystack model. iLerman et al.l (120 120 
established w.h.p. the SNR of the REAPER algorith m in Theorem 1 of their work (for simplicity of 
their expressions they assumed that d < (D — 1) /2). IZhangl (120121) established the SNR for Tyler's 
M-Estimator (TME) in Theorem 1 of his work. His result is deterministic, but it is easy to show that 
his weak deterministic condition holds with probability 1 under the needle-haystack model. 



Table 1 : Theoretical SNR (lowest bound onai/ ao) for exact recovery when N — >• oo 



HR-PCA 



LLD (OP) 



L := ker(Q) 



REAPER (d < (D — l)/2) 



TME 



a a 



ai > 
a — 



121d 
9 



«0 <n yJ(D-d)D 



cti ^ (to / /"< d 



«■() 



HI > 
a 



D-d 



The asymptotic SNR of the minimization proposed in this paper is of the same order as that of 
the REAPER algorithm (which was established for d < {D — l)/2) and both of them are clearly 
better than the HR-PCA algorithm. Both OP and TME are independent of o\ and gq. However, by 
normalizing all data points t o the unit sphe r e, we may assume that o\ = o§ in all other algorithms 
and treat them equally (see ILerman et al.1 (|2012l) ). In this case, the SNR of OP is significantly 
worse than that of the minimization proposed in here, especially when d <S D (it is also worse 
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than the weaker SNR specified in (fl9V). When d < D, the SNR of TME is in the same order 
of the asymptotic SNR of our formulation. However, when d is very close to D, the SNR of our 
formulation is better than the SNR of TME by a factor of \f~D. We question whether a better 
asymptotic rate than the one of our algorithm and REAPER can be obtained by a convex algorithm 
for robust subspace recovery. 

We note though there are non-convex methods for removing outliers, whose SNRs are asymp- 
totically zero. Such SNRs are valid only for the noiseless case and may be differently formu- 
lated for detecting the h idden low-dimensional structure among uniform outliers. For example, 
Arias-Castro et all (|2005h proved that the scan statistics may detect points sampled uniformly from 
a d-dimensional graph in R D of an m-differ entiable function among u niform outliers in a cube in 



with SNR of order 0{N d l^ 



■m(D-d))' 



Arias-Castro et al.l ((201 lh used higher order spectral 



clustering affinities to remove outliers and thus detect differentiable surfaces (or certain unions of 
such surfaces) among uniform outliers w ith SNR that is only worse by a logarithmic factor than 
the one established for the scan statistics. ISoltanolkotabi and Candesl (|201lh removed outliers with 
"large dictionary coefficients" and showed that this detection works well for outliers uniform in 



S D -\ inliers uniform in S " 1 n L* and SNR at least -i ■ ((— 



QliV-l 



cD 
d 



1) 1 (where a\ is the 



fraction of inliers) as long as N < ^e Cv ^. Furthermore, Lerman and Zhangl ( 2010 ) showed that 
the global minimizer of (0 (that we relax in this paper so that the minimization is convex) can in 
theory recover the subspace with asymptotically zero SNR. They also showed that the underlying 
subspace is a local minimum of © with SNR of order oj(l/y/N). Nevertheless, these theoretical 
estimates break down in the presence of noise. Indeed, in the noi sy case the SNRs are asymp toti- 
cally positive and depend on the size of noise. In fact, in view of Coudron and Lerrnar] ( 2012 ) we 
may obtain better asymptotic SNRs when the noise is symmetrically distributed with respect to the 
underlying subspace. 



2.8 Near Subspace Recovery for Noisy Samples 

We show that in the case of sufficiently small additive noise (i.e., the inliers do not lie exactly on 
the subspace L* but close to it), the GMS algorithm nearly recovers the underlying subspace. We 
even prove a more general results. Its formulation uses the constants 70 = 7o(^) and 7q = j' (X), 
which will be specified in §7.71 It also uses the following standard notation: If A € ~R kxl , then 
|| A || j? and A|| are the Frobenius and spectral norms of A. 



Theorem 6 Assume that {e^}^ is a set of positive numbers, X 



{xj}^ and X 



{*;} 



N 



are 



two data sets such that ||xj — Xj|| < VI < i < N and X satisfies (110b - Let Fx(Q) and F^(Q) 
denote the corresponding versions of F(Q) w.r.t. the sets X and X and let Q and Q denote their 
respective minimizers. Then 



IQ-QIIf < 



N 



, 2^ej/7o and ||Q-Q||< 
\ i=i 



N 



\ i=i 



(21) 



Moreover, if 'L and L are the subspaces spanned by the bottom d eigenvectors (i.e., with lowest 
d eigenvalues) o/Q and Q respectively and vr>-d is the (D — d)th eigengap ofQ, then 



|p l - p lII^< 



VD-d 



and ||p£ - Pl|| < 



VD-d 



(22) 
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Theorem |6] is a perturbation result in the spirit of the stability analysis by ICandes et al.l (120061) 
and IXu et al.l (120121. Theorem 2). It is also possible to establish exact asymptotic recovery under 
some conditions of the nois e, but it is rather nontrivial and will be established in a future work by 
Coudron and Lermanl ( 2012 ). 

In order to observe that the statement of Theorem [6] is comparable to that of Theorem 2 of 
Xu et all (120121) . we note that asymptotically the bounds on the recovery errors in (|2TI ) and (l22l 
depend only on the mean of {ei}^L 1 and are independent of N. To clarify this point we formulate 
the following proposition. 



Proposition 7 If X is Ltd. sampled from a distribution fj, and 

^(Li) + fJ-(J->2) < 1 f° r any two D — 1-dimensional subspaces Li and L2, 
then there exist constants cq(/i) > and c' (p) > depending on \i such that 



lim inf 

Af-s>oo 



7oW 
N 



> co(fJ>) 



and lim inf 



N 



> c' (/j,) almost surely . 



(23) 



(24) 



We note that if X is a data set whose inliers are exactly on L* and for which (TTb-dTQb hold, 
then Theorem [6] implies that the GMS algorithm nearly recovers L* when given d and a slightly 
perturbed version of X, X. More interestingly, the theorem implies that we may properly estimate 
the dimension of the underlying subspace in this case. Indeed, if Q is a low-rank matrix with 
ker(Q) = L*, the n the (D — d + l)st eigenvalue of Q is 0. Applying the following eigenvalue 
stability inequality jTaolbolll. (1.63)): 



|Ai(A + B)-Ai(A)| < || B I 



(25) 



we obtain that the (D — d+ l)st eigenvalue of Q is smaller than \ l 2 Yli=i e i/7o> an d the (D — d)th 



eigengap of Q is larger than VD-d ^2iL\ £i/lo- This means that when the noise is small and 

the condition in Theorem [TJ holds for X (the noiseless data), then we can estimate the dimension of 
the underlying subspace for X from the number of small eigenvalues. Such dimension estimation 
is demonstrated later in Figure [2] 



2.9 Near Subspace Recovery for Regularized Minimization 

For our practical algorithm it is advantageous to regularize the function F as follows (see Theo- 
rems [12] and [TJ] below) : 



iv 



N 



F S (Q):= Y, IIQ X *H+ E 

i=l,||Qxj||>a j=l,||Qxi||<(5 



— " + - 

25 2 



(26) 



We re mark that other convex algorithms (ICandes et all. 1201 ll ; IXu et all. |2012| ; iMcCoy and Troppl. 
201 lh also regularize their objective function by adding the term <5||X — L — 0|||i. However, their 
proofs are not formulated for this regularization. 

In order to address the regularization in our case and conclude that the GMS algorithm nearly 
recovers L* for the regularized objective function, we adopt a similar perturbation procedure as in 
£12.81 We denote by Q$ and Q the minimizers of Fg(Q) and F(Q) in H respectively. Furthermore, 
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let Lg and L denote the subspaces recovered by the bottom d eigenvectors of Q,5 and Q respectively. 
Using the constants vo-d and 70 of Theorem [6j the difference between the two minimizers and 
subspaces can be controlled as follows. 



Theorem 8 If X is a data set satisfying (1101 ). then 

||Qi-Q||F < y/N8/2 lQ 



and 



< 



2y/N6/2 lQ 

VD-d 



(27) 



(28) 



3. Significance of Our M-Estimator 

3.1 Problems with Exact Recovery for the Common M-estimator 

A com mon and well-known M-estimator ( Maronna , 19761 : Huber and Ronchetti , 20091 : Maronna et al 



2006) tries to recover the 0-centered covariance matrix. The image of the estimated covariance is 
clearly an estimator to the underlying subspace L* . This M-estimator for the covariance matrix is 
the minimizer of the following function over all D x D nonnegative symmetric matrices (for some 
choices of a function p) 



N 



N 



L(A) = ^ /0 (xfA- 1 x,) + -log|A|. 



(29) 



This minimizer can be obtained as a limit of nonnegative symmetric matrices in M. DxD , { A m } me pj, 
as follows: 



N 



A (m+i) = ^ n ( x f A^- 1 ^)^ /N, 



(30) 



i=l 



wh ere u(x) = 2p'( x). 



Kent and Tylen (119911) have concluded three sufficient conditions for the existence and unique- 



ness of the minimizer of L(A): 1) u is positive, continuous and non-increasing. 2) Condition 
M: u(x)x is strictly increasing. 3) Condition Do: For any linear subspace L: \X n L|/JV < 
1 — (D — dim(L))/ lim a ._>. 00 xu(x). 

We show here that these well-known uniqueness and existence conditions of the common M- 
estimator are incompatible with exact recovery. 

Theorem 9 Assume that d, D G N, d < D, X is a data set in R D and L* G G(D,d) and let A be 
the minimizer of ( |29l >. If conditions M and Dq hold, then Im(A) 7^ L*. 

Tvler dl987h suggeste d to use the fu nction p(x) = D\og(x)/2 in d29l and the additional as- 
sumption that tr(A) = 1. IZhangl (|2012h recently showed that th is M-estimator satisfie s Im(A) = 
L*. However, it does not belong to the class of estimators of iKent and Tylerl (1199 ll) addressed 
by Theorem |9] First of all, condition M does not hold for u{x) = D/(2x). More importantly, 
one can note that the estimator (1291 with p(x) = Dlog(x)/2 (without the additional requirement 
tr(A) = 1) has multiple minimizers (indeed any scaling by a positive constant of a minimizer is 
also a minimizer). It is thus clear that Tyler's estimator (with the additional requirement tr( A) = 1) 
is different than the class of unique M-estimators that minimize 
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3.2 Interpretation of Q as Robust Inverse Covariance Estimator 

The total least squares subspace approximation is practically the minimization over L £ G(D, d) of 
the function 

N N 

^||x l -P L x J || 2 = ^||P L xx i || 2 . (31) 

i=l i=l 

Its solution is obtained by the span of the top d right vectors of the data matrix X (whose rows are the 
data points in X ), or equivalently, the top d eigenvectors of the covariance matrix X T X. The convex 
relaxation used in (l3Tb can be also applied to (DTI ) to obtain the following convex minimization 
problem: 

N 

Q 2 := argmin ||Qx;|| 2 . (32) 

The "relaxed" total least squares subspace is then obtained by the span of the bottom d eigenvectors 
of Q. 

We show here that Q 2 coincides with a scaled version of the empirical inverse covariance matrix. 
This clearly imply that the "relaxed" total least squared subspace coincides with the original one. 

Theorem 10 TfX and Q 2 are the data matrix and minimize r of (1321 ) and rank(X) = D (equiva- 
lently the data points span M D ), then 

Q 2 = (X T X)- 1 /tr((X T X)- 1 ). (33) 



Corollary 11 IfK and Q 2 are the data matrix and minimize r of (1321 ) and rank(X) = D, then their 
SVD's(Q 2 =\^\ 

way: For any 1 < i < D 



SVD's (Q 2 = Vq 2 ^q 2 V^T and X = Ux^xV^j are determined by each other in the following 



*i(Q 2 ) = c/^+i-i(X), where c = \ T ^ 2 +i-,( X ) ) (34) 

and for any n\ < n 2 such that cr„ 1 _i(X) > <7 ni (X) = <r„ 2 (X) > cr n2+ i(X) 

SP ({VQ 2(:D+1 _ t) }- ni ) = Sp ({Vx^j-J . (35) 

We view (@]) as a robust version of (l32l ). Since we verified robustness of the subspace recovered 
by (01) and also showed that (l32l) yields the inverse covariance matrix, we sometimes refer to the 
solution of (01) as a robust inverse covariance matrix (though we have only verified robustness to 
subspace recovery). This idea helps us inteipret our numerical procedure for minimizing ©, which 
we present in $4] 

4. IRLS Algorithms for Minimizing © 

4.1 Heuristic Proposal for Two IRLS Algorithms 

The procedure for minimizing (@} formally follows from the simple fact that the directional deriva- 
tive of F at Q in any direction Q — Q, where Q S H, is 0, i.e., 

^'(Q) . , Q - Q ) = for any QeH. (36) 
Q=Q / p 
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We remark that since H is an affine subspace of matrices, (1361 ) holds globally in H and not just 
locally around Q. 

We formally differentiate (@} at Q as follows (see more details in (|4"7T ). which appears later): 

; Q=Q ^ 2||Qx i || 

Throughout the formal derivation we ignore the possibility of zero denominator in 071 ), that is, we 
assume that Qxj ^ 0\/ 1 < i < N; we. later address this issue. 

Since F'(Q) is symmetric and Q — Q can be any symmetric matrix with trace 0, it is easy 
to note that (l36l ) implies that F'(Q) is a scalar matrix (e.g., multiply it by a basis of symmetric 
matrices with trace whose members have exactly 2 nonzero matrix elements). That is, 

£ Q^i+i^iQ = cI (38) 

j=i 2||Q X j|| 

for some cel. This implies that 

Indeed, we can easily verify that d39l ) solves (I38T). furthermore , (I38T ) is a Lyapunov equation whose 
solution is unique (see e.g., page 1 of Bhatia and Drissil (|2005|)). Since tr(Q) = 1, we obtain that 




Q 



which suggests the following iterative estimate of Q: 

/trl 





Formula (1401 ) is undefined whenever Q^Xj = for some k G N and 1 < i < N. In theory, we 
address it as follows. Let I(Q) = {1 < i < N : Qxj = 0}, L(Q) = Sp{xj}j 6 /(Q) and 





T(Q)=P U Q)A >. T^I P L(Q)x/tr P L(Q) x ^ ,^|P L(q) i|. (41) 

Using this notation, the iterative formula can be corrected as follows 

Q k+1 = r(Q fc ). (42) 

In practice, we can avoid data points satisfying ||QfcXj|| < 5 for a sufficiently small parameter 5 
(instead of ||QfcXj|| = 0). We follow a similar idea by replacing F with the regularized function F$ 
for a regularized parameter 5. In this case, (1421 obtains the following form: 

Qfc+1= (^maxdlQ^IU)) /tr max(||Q fc x 4 |U)) J" (43) 
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We note that the RHS of (@2]> is obtained as the limit of the RHS of (03]) when 8 approaches 0. 

The two iterative formulas, i.e., (I4"2l and (l43l . give rise to IRLS algorithms. For simplicity of 
notation, we exemplify this idea with the formal expression in (l40l ). It iteratively finds the solution 
to the following weighted (with weight l/||QfcX.;||) least squares problem: 



/v 



argminV 1 „ ||Qx^|| 2 . (44) 
Qee ~— ^ llWfcXill 

To show this, we note that (1441 is a quadratic function and any formal directional derivative at Qfc+i 
is 0. Indeed, 

N / N T \ / N 



d ^ 1 



T \ / T 
X„- \ / v a X ; X, 



d Q ~[ IIQfcXjH Q=Q fc+ i y-^ IIQfcXiH y \^ IIQpqH j 

for some c £ R, and /i, Q — Q/c+i^> = for any Qel Consequently, Qk+i of (l40l) is the 
minimizer of (I4"4"l) . 

Formula d43l (as well as (1421) ) provides another interpretation for Q as robust inverse covariance 
(in addition to the one discussed in §3-2b - Indeed, we note for example that the RHS of (1431 ) is the 
scaled inverse of a weighted covariance matrix; the scaling enforces the trace of the inverse to be 
1 and the weights of XjX^ are significantly larger when Xj is an inlier. In other words, the weights 
apply a shrinkage procedure for outliers. Indeed, since QfeXj approaches Qxj and the underlying 
subspace, which contain the inliers, is recovered by ker(Q), for an inlier x^ the coefficient of XjX^ 
approaches 1/5, which is a very large number (in practice we use 5 = 10~ 20 ). On the other hand, 
when Xj is sufficiently far from the underlying subspace, the coefficient of XjX?" is significantly 
smaller. 

4.2 Theory: Convergence Analysis of the IRLS Algorithms 

The following theorem analyzes the convergence of the sequence proposed by (l42l) to the minimizer 
of©. 

Theorem 12 Let X = {xj}^ be a data set in M. D satisfying (1 10b . Q the minimizer of (O, Qo 
an arbitrary symmetric matrix with tr(Qo) = 1 and {QijieN the sequence obtained by iteratively 
applying (l42l ) (while initializing itwith Qo), then {Qj}jgN converges to a matrix QgH. IfQ^i ^ Q 
for all 1 < i < N, then the sequence {QjjigN converges linearly to Q and Q = Q. 

The condition for the linear convergence to Q in Theorem [12] (i.e., Qxj ^ for all 1 < i < 
N) usually does not occur for noiseless data. This phenomenon is common in IRLS algorithms 
whose obje ctive functions ar e ^i-type and are not twice differentiable at 0. For example, Weiszfeld's 



Algori t hm (IWeiszfeldl.ll937h may not converge to the geometric median but to one of the data points 



dKuhnlll973l. §3 .4). On the other hand, regularized IRLS algorithms often converge linearly to the 



minimizer of the regularized function. We demonstrate this principle in our case as follows. 

Theorem 13 Let X = {xj}^ 1 be a data set in H D satisfying (1101 ). Qo an arbitrary symmetric 
matrix with tr(Qo) = 1 and {QjjieN the sequence obtained by iteratively applying (143] ) (while 
initializing it with Qo). Then the sequence {Qj}jgN converges linearly to the unique minimizer of 
Fs(Q). 
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T he linear convergence ra te of the iterative application of (l43l depends on 5. Following Theorem 
6. 1 of Ichan and Muletl dl999h . this rate is at most 



r(S) 



max 
^ A=A T ,tr(A)=0 



2^i=l,\\Q tXi \\>6 



(xf AQ.x,) 2 



maxdlQ.Xi^H) 



That is, HQ* - Q|| < C ■ r(5) k for some constant C > 0. If CEQ]> holds, then r(8) < 1 for all 5 > 
and r(<5) is a non-increasing function. Furthermore, if {xj £ A' : ||Qxj|| ^ 0} satisfies assumption 
(ITOb . then lim^o r(5) < 1. 



4.3 The Practical Choices for the IRLS Algorithm 

Following the theoretical discussion in §4.21 we prefer using the regularized version of the IRLS 
algorithm. We fix the regularization parameter to be smaller than the rounding error, i.e., 5 = 10 -20 
so that the regularization is very close to the original problem (even without regularization the 
iterative process is stable, but may have few warnings on badly scaled or close to singular matrices). 
The idea of the algorithm is to iteratively apply (l43l with an arbitrary initialization (symmetric with 
trace 1). We note that in theory {-F<s(Qfc)}ifceN is non-increasing (see e.g., the proof of Theorem [131 . 
However, empirically the sequence decreases when it is within the rounding error to the minimizer. 
Therefore, we check F$(Qk) every four iterations and stop our algorithm when we detect an increase 
(we noticed empirically that checking every four iterations, instead of every iteration, improves the 
accuracy of the algorithm). Algorithm [2] summarizes our practical procedure for minimizing Q. 

Algorithm 2 Practical and Regularized Minimization of (O 

Input: X = {xi, X2, • • • , xat} C MP: data 

Output: Q: a symmetric matrix in R DxD with tr(Q) = 1. 

Steps: 

. 5 = 1(T 20 

• Arbitrarily initialize Qo to be a symmetric matrix with tr(Q ) = 1 

• k = -l 
repeat 

• k=k+l 

£t=l maxdlQfcX, !!^) J / tr U Ei=l maxdlQ^I^) J 

until F(Q fc+ i) > F(Q fc _ 3 ) and mod (k + 1, 4) = 

• Output Q := Qfc 



4.4 Complexity of Algorithm [2] 

Each update of Algorithm |2]requires a complexity of order 0(N • D 2 ), due to the sum of N D x D 
matrices. Therefore, for n s iterations the total running time of Algorithm[2]is of order 0(n s -N-D 2 ). 
In most of our numerical experiments n s was less than 40. The storage of this algorithm is 0(N x 
D), which amounts to storing X. Thus, Algorithm |2] has the same order of storage and complexity 
as PCA. In practice, it might be a bit slower due to a larger constant for the actual complexity. 
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5. Subspace Recovery in Practice 

We view the GMS algorithm as a prototype for various subspace recovery algorithms. We discuss 
here modifications and extensions of this procedure in order to make it even more practical. 

5.1 Subspace Recovery without Knowledge of d 

In theory, the subspace recovery described here can work without knowing the dimension d. In the 
noiseless case, one may use © to estimate the subspace as guaranteed by Theorem [T] In the case of 
small noise one can estimate d from the eigenvalues of Q and then apply the GMS algorithm. This 
strategy is theoretically justified by Theorems Q] and [6] as well as the discussion following (|25T ). The 
problem is that condition © for guaranteeing exact recovery by GMS is restrictive; in particular, 
it requires the number of outliers to be larger than at least D — d (according to our numerical 
experiments it is safe to use the lower bound 1.5 (D — d)). 

We describe here several solutions for dealing with smaller number of outliers and later demon- 
strate them numerically in £16.21 for artificial data and in £16.71 and £16.81 for real data. Nevertheless, 
we remark that the general problem of subspace recov ery without knowing the dimension is rather 
difficu lt Other a lgorithms for this problem, such as IChandrasekaran et al. d2009h : ICandes et all 



(1201 ll) : Ku et all (1201 Obi) : iMcCoy and Troppl (120111) . require estimates of unknown regularization 



parameters (which often depend on various properties of the data, in particular, the unknown in- 
trinsic dimension) or strong assumptions on the underlying distribution of the outliers or corrupted 
elements. We also note that if only conditions © and © hold, then Theorem Q] still guarantees that 
the GMS algorithm outputs a subspace containing the underlying subspace. Using some informa- 
tion on the data one may recover the underlying subspace from the outputted subspace containing 
it. 

Our first practical solution to deal with the restriction on the number of outliers is to reduce the 
ambient dimension of the data. When the reduction is not too aggressive, it can be performed via 
PCA. In £J5.3I we also propose a robust dimensionality reduction which can be used instead. There 
are two problems with this strategy. First of all, the reduced dimension is another parameter that 
requires tuning. Second of all, some information may be lost by the dimensionality reduction and 
thus exact recovery of the underlying subspace is generally impossible. 

A second practical solution is to add artificial outliers. The number of added outliers should not 
be too large (otherwise © and © will be violated), but they should sufficiently permeate through 
M D so that © holds. In practice, the number of outliers can be 2D, since empirically © holds 
with high probability when Nq = 2(D — d). To overcome the possible impact of an outliers with 
arbitrarily large mag nitude, we project the data with artificial outliers onto the sphere (following 



Lerman et al.l (j2012|)). Furthermore, if the original data matrix does not have full rank (in particular 
if N < D) we reduce the dimension of the data (by PCA) to be the rank of the data matrix. This 
dimensionality reduction clearly does not result in any loss of information. We refer to the whole 
process of initial "lossless dimensionality reduction" (if necessary), adding 2D artificial Gaussian 
outliers, normalization onto the sphere and application of GMS (with optional estimation of d by 
the eigenvalues of Q) as the GMS 2 algorithm. We believe that it is the best practical solution to 
avoid condition © when d is unknown. 
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A third solution is to regularize our M estimator, that is, to minimize the following objective 
function with the regularization parameter A: 



N 



Q= argmin S~] ||Qxj|| + A||Q||^. 
tr(Q)=l,Q=Q T i=1 



The IRLS algorithm then becomes 



Q fc - 



< N 

£ 



XiX 



T 



^ max(\\Q k Xi\\,5) 



N 



+ 2AI /tr j; 



XiX„- 



(45) 



^ max(||Q fc Xi||,(5) 



+ 2AI . 



We note that if A = and there are only few outliers, then in the noiseless case dim(ker(Q)) > d 
and in the small noise case the number of significantly small eigenvalues is bigger than d. On the 
other hand when A — >■ oo, Q — >■ I/D, whose kernel is degenerate (similarly, it has no significantly 
small eigenvalues). Therefore, there exists an appropriate A for which dim(ker(Q)) (or the number 
of significantly small eigenvalues of Q) is d. This formulation transforms the estimation of d into 
estimatio n of A. This strategy is in line with other common regularized solut i ons to this problem 
(see e .g., Ichandrasekaran et all d2009h : ICandes et all (boill) : Ixu et all fcoiObl) : ImcCov and Tropp 
(|201 lh ). however, we find it undesirable to estimate a regularization parameter that is hard to inter- 
pret in terms of the data. 



5.2 Subspace Recovery with Knowledge of d 

Knowledge of the intrinsic dimension d can help improve the performance of GMS or suggest 
completely new variants (especially as GMS always finds a subspace containing the underlying 
subspace). For example, knowledge of d can be used to carefully estimate the parameter A of (|45T ). 
e.g. , by finding A yieldin g exactly a ci-dimensional subspace via a bisection procedure. 



Lerman et al.l (120121) modified the strategy described in here by requiring an additional con- 



straint on the maximal eigenvalue of Q in (l29l ): A max (Q) < -jj^ (where A max (Q) is the largest 
eigenvalue of Q). This approach has theoretical guarantees, but it comes with the price of additional 
SVD decomposition in each iteration, which makes the algorithm slightly more expensive. Besides, 
in practice (i.e., noisy setting) this approach requires tuning of the upper bound on A max (Q). In- 
deed, the solution Q' to their minimization problem (with A max (Q') < 1/(D — d) and tr(Q') = 1) 
satisfies that dim(ker(Q') is at most d and equals d when Q' is a scaled projector operator. They 
proved that dim(ker(Q') = d for the setting of pure inliers (lying exactly on a subspace) under 
some conditions avoiding the three types of enemies. However, in practice (especially in noisy 
cases) the actual subspace often has dimension smaller than d and thus the bound on A max (Q) has 
to be tuned as an additional parameter. In some cases, one may take A max (Q) > -^r^ and find the 
subspace according to the bottom d eigenvectors. In other cases, a bisection method on the bound 
of A max (Q) provide more accurate results (see related discussion in Lerman et al. ( 2012 . §6.1.6)). 



5.3 The EGMS Algorithm 

We formulate in Algorithm [3] the Extended Geometric Median Subspace (EGMS) algorithm for 
subspace recovery with known intrinsic dimension. 

We justify this basic procedure in the noiseless case without requiring (© as follows. 
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Algorithm 3 The Extended Geometric Median Subspace Algorithm 

Input: X = {xj}^ 1 C IR D : data, d: dimension of L*, an algorithm for minimizing (01) 

Output: L: a d-dimensional linear subspace in W D . 

Steps: 

• L = R D 

repeat 

• Q = argmm QgHQp . x= QF(Q) 

• u = the top eigenvector of Q 

• L = L n Sp(u- L ) 
until dim(L) = d 



Theorem 14 Assume that d, D G N, d < D, X is a data set in R D and L* G G(D,d). If only 
conditions (© and © hold, then the EGMS Algorithm exactly recovers L*. 

In £|6.4l we show how the vectors obtained by EGMS at each iteration can be used to form robust 
principal components (in reverse order), even when Q is degenerate. 

5.3.1 Computational Complexity of GMS and EGMS 

The computational complexity of GMS is of the same order as that of Algorithm[2l i.e., 0(n s ■ N ■ 
D 2 ) (where n s is the number of required iterations for Algorithm Indeed, after obtaining Q, 
computing L* by its smallest d eigenvectors takes an order of 0(d ■ D 2 ) operations. 

EGMS on the other hand repeats Algorithmic^ — d times; therefore it adds an order of 0((D — 
d) • n s ■ N ■ D 2 ) operations, where n s denotes the total number of iterations for Algorithm |2] 
In implementation, we can speed up the EGMS algorithm by excluding the span of some of the 
top eigenvectors of Q from L (instead of excluding only the top eigenvector in the third step of 
Algorithmic. We demonstrate this modified procedure on artificial setting in §6.21 



6. Numerical Experiments 
6.1 Model for Synthetic Data 

In ^6.31^6.6l we generate data from the following model. We randomly choose L* G G(D, d), sam- 
ple N\ inliers from the d-dimensional Multivariate Normal distribution N(0, Idxd) on L* and add 
A^o outliers sampled from a uniform distribution on [0, 1] D . The outliers are strongly asymmetric 



aroun d the subspace to make the subspace recovery problem more difficult (|Lerman and Zhangl. 



2010I) . In some experiments below additional Gaussian noise is considered. When referring to this 
synthetic data we only need to specify its parameters N\, Nq, D, d and possibly the standard devi- 
ation for the additive noise. For any subspace recovery algorithm (or heuristics), we denote by L its 
output (i.e., estimator for L*) and measure the corresponding recovery error by = ||Pf — Pl* 



6.2 Demonstration of Practical Solutions of 15. H and ^5.21 

We present two different artificial cases, where in one of them condition © holds and in the other 
one it does not hold and test the practical solutions of ^5.1l and ^5.2l in the second case. 
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The two cases are the following instances of the synthetic model of £16.11 (a) (N\,No,D,d) 
= (100 , 100, 100, 20) and (b) (JVi, N , D, d) = (100, 20, 100, 20). The GMS algorithm estimates 
the underlying subspace L* given d = 20 with recovery errors 2.1 x 10~ 10 and 3.4 in cases (a) 
and (b) respectively. In case (a) there are sufficiently many outliers (with respect to D — d) and the 
GMS algorithm is successful. We later show in $63] that the underlying dimension (d = 20) can be 
easily estimated by the eigenvalues of Q. In case (b) iVo = 0.25 * (D — d), therefore, condition (O 
is violated and the GMS algorithm completely fails. 

We demonstrate the success of the practical solutions of ^5. II and §5.2l in case (b). We assume 
that the dimension d is known, though in £|6.3l we estimate d correctly for the non-regularized solu- 
tions of £15.11 Therefore, these solutions can be also applied without knowing the dimension. If we 
reduce the dimension of the data set in case (b) from D = 100 to D = 35 (via PCA; though one can 
also use EGMS), then GMS (with d = 20) achieves a recovery error of 0.23, which indicates that 
GMS almost recovers the subspace correctly. We remark though that if we reduce the dimension 
to e.g., D = 55, then the GMS algorithm will still fail. We also note that the recovery error is not 
as attractive as the ones below; this observation probably indicates that some information was lost 
during the dimension reduction. 

The GMS2 algorithm with d = 20 recovers the underlying subspace in case (b) with error 1.2 x 
10~ 10 . This is the method we advocated for when possibly not knowing the intrinsic dimension. 

The regularized minimization of (|45T ) with A = 100 works well for case (b). In fact, it recovers 
the subspace as ker Q (without using its underlying dimension) with error 3.3 x 10~ 13 . The only 
issue is how to determine the value of A. We claimed in £]5.2I that if d is known, then A can be 
carefully estimated by the bisection method. This is true for this example, in fact, we initially chose 
A this way. 

We remark that the REAPER algorithm of iLerman et al.l (|2012l) did not perform well for this 
particular data, though in general it is a very successful solution. The recovery error of the direct 
REAPER algorithm was 3.725 (and 3.394 for S-REAPER) and the error for its modified version via 
bisection (relaxing the bound on the largest eigenvalue so that dim(ker(Q)) = 20) was 3.734 (and 
3.175 for S-REAPER). 

At last we demonstrate the performance of EGMS and its faster heuristic with d = 20. The 
recovery error of the original EGMS for case (b) is only 0.095. We suggested in £]5.3.1l a faster 
heuristic for EGMS, which can be reformulated as follows: In the third step of Algorithm [3) we 
replace u (the top eigenvector of Q) with U, the subspace spanned by several top eigenvectors. In 
the noiseless case, we could let U be the span of the nonzero eigenvectors of Q. This modification 
of EGMS (for the noiseless case) required only two repetitions of Algorithm [2] and its recovery error 
was 2.2 x 10~ 13 . In real data sets with noise we need to determine the number of top eigenvectors 
spanning U, which makes this modification of EGMS less automatic. 



6.3 Demonstration of Dimension Estimation 

We test dimension estimation by eigenvalues of Q for cases (a) and (b) of £]6.2I The eigenvalues 
of Q obtained by Algorithm [2] for the two cases are shown in Figure [2 In case (a), the largest 
logarithmic eigengap (i.e., the largest gap in logarithms of eigenvalues) occurs at 80, so we can 
correctly estimate that d = D — 80 = 20 (the eigenvalues are not zero since Algorithm [2] uses the 
(^-regularized objective function). However, in case (b) the largest eigengap occurs at 60 and thus 
mistakenly predicts d = 40. 
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Figure 2: Dimension estimation: In the left figure, the starred points and the dotted point represent 
log-scaled eigenvalues of the output of Algorithm\2\for cases (a) and (b) respectively (see 
Q6.3\l . The righ t figure corresponds to case (b) with dimension reduced to 35. 



As we discussed in Q6.2[ the dimension estimation fails here since condition © is not satisfied. 
However, we have verified that if we try any of the solutions proposed in £ j5.1l fhen we can correctly 
recover that d = 20 by the logarithmic eigengap. For example, in Figure [2] we demonstrate the 
logarithms of eigenvalues of Q in case (b) after dimensionality reduction (via PCA) onto dimension 
D = 35 and it is clear that the largest gap is at d = 20 (or D — d = 80). We obtained similar graphs 
when using 2D artificial outliers (more precisely, the GMS2 algorithm without the final application 
of the GMS algorithm) or the regularization of (|45T ) with A = 100. 

6.4 Information obtained from Eigenvectors 

Throughout the paper we emphasized the subspace recovery problem, but did not discuss at all the 
information that can be inferred from the eigenvectors of our robust PCA strategy. Since in standard 
PCA these vectors have significant importance, we exemplify the information obtained from our 
robust PCA and compare it to that obtained from PCA and some other robust PCA algorithms. 

We create a sample from a mixture of two Gaussian distributions with the same mean and 
same eigenvalues of the covariance matrices, but different eigenvectors of the covariance matrices. 
The mixture percentages are 25% and 75%. We expect the eigenvectors of any good robust PCA 
algorithm (robust to outliers as perceived in this paper) to be close to that of the covariance of the 
main component (with 75%). 

More precisely, we sample 300 points from N(0, Si), where Si is a 10 x 10 diagonal matrix 
with elements 1, 2 _1 , 2" 2 , • • • , 2" 9 and 100 points from N(0, S 2 ), where S 2 = USiU T , where 
U is randomly chosen from the set of all orthogonal matrices in M 10xl °. The goal is to estimate 
the eigenvectors of Si (i.e., the standard basis vectors in M 10 ) in the presence of 25% "outliers". 
Unlike the subspace recovery problem, where we can expect to exactly recover a linear structure 
among many outliers, here the covariance structure is more complex and we cannot exactly recover 
it with 25% outliers. 

We estimated the eigenvectors of Si by the the eigenvectors of Q of Algorithm [2] in reverse 
order (recall that Q is a scaled and robust version of the inverse covariance). We refer to this 
procedure as "EVs (eigenvectors) of Q 1 ". We also estimated these eigenvectors by standard 
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PCA, LLD dMcCov and Troppl . 1201 lh with A 
A 



0.8y/D/N and PCP dCandes et all 1201 fl) with 
1 / y / max(D, N). We repeated the random simulation (with different samples for the random 
orthogonal matrix U) 100 times and reported in Table |2] the average angles between the estimated 
and actual top two eigenvectors of Si according to the different methods. We note that the "EVs of 
Q _1 " outperforms PCA, LLD (or OP) and PCP in terms of estimation of the top two eigenvectors 
of Si. We remark though that PCP does not suit for robust estimation of the empirical covariance 
and thus the comparison is unfair for PCP. 



Table 2: Angles (in degrees) between the estimated and actual top two eigenvectors of Si. 





EVs of cr 1 


LLD 


PCP 


PCA 


Eigenvector 1 
Eigenvector 2 


3.0° 
3.0° 


5.5° 
5.5° 


45.7° 
47.4° 


14.8° 
40.3° 



When the covariance matrix Si (and consequently also S2) is degenerate, Q might be singular 
and therefore Q cannot be directly used to robustly estimate eigenvectors of the covariance matrix. 
For this case, EGMS (Algorithm can be used, where the vector u obtained in the ith iteration 
of Algorithm [3] can be considered as the (D — i + l)st robust eigenvector (that is, we reverse the 
order again). To test the performance of this method, we modify Si in the above model as follows: 
Si=diag(l, 0.5, 0.25, 0, 0, • • • ,0). We repeated the random simulations of this modified model 
100 times and reported in Table |2] the average angles between the estimated and actual top two 
eigenvectors of Si according to the different methods. Here LLD did slightly better than EGMS 
and they both outperformed PCA (and PCP). 



Table 3: Angles (in degrees) between the estimated and actual top two eigenvectors of Si. 





EGMS 


LLD 


PCP 


PCA 


Eigenvector 1 
Eigenvector 2 


5.2° 
5.2° 


3.4° 
3.4° 


42.6° 
47.3° 


8.2° 
16.1° 



6.5 The Effect of the Regularization Parameter 5 



We assume a synthetic data set sampled according to the model of £16.11 with (N\, Nq, D,d) = 
(250,250,100,10). We use the GMS algorithm with d = 10 and different values of the reg- 
ularization parameter 5 and record the recovery error in Figure [3] For 10 -14 < 5 < 10~ 2 , 
log(error) — log(5) is constant. We thus empirically obtain that the error is of order 0(5) in this 
range. On the other hand. (l28l only obtaine d an order of 0(VS). It is possible that methods sim- 
ilar to those of ICoudron and Lermanl (120121) can obtain sharper error bounds. We also expect that 
for 5 sufficiently small (here smaller than 10 -14 ), the rounding error becomes dominant. On the 
other hand, perturbation results are often not valid for sufficiently large 5 (here this is the case for 



5 > 10" 2 



6.6 Detailed Comparison with other algorithms for Synthetic Data 



Using the synthetic data of Q6.l\ we compare d the GMS algorithm with the following algorithms: 
MDR (Mean Absolute Deviation Rounding) of McCoy and Troppl ( 201 1 ). LLD (Low -Leverage De- 
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logioO 5 ) 



Figure 3: The recovery errors and the regularization parameters 5 



composition) of lMcCov and Troppl (hoilh. OP (Outlier Pursuit) of IXu et all (boiObl) . PCP (Princi 
pal Co mponent Pursuit) of ICandes et all (|201 ll ). MKF ( Median if -flats w ith K = 1) of lzhang et al 



(200 91), HR-PCA (High-dim ensional Robust PCA) o f Ku et all ( 2010a!) . a common M-estimator 
(|Huber and Ronchetti|]2009l . see e.g.,) and i?i-PCA of bing et all (|2006l ). The codes of OP and HR- 
PCA were obtained from http://guppy.mpe.nus.edu.sg/~mpexuh , the code of MKF from http://www.math.umn.edu/~zh; 
, the code of PCP from http://perception.csl.illinois.edu/matrix-rank/sample_code.html . with the 
Accelerated Proximal Gradient and full SVD version, the codes of MDR and LLD from http://www.acm.caltech.edu/~mc 
and the codes of the common M-estimator, i?i-PCA and GMS will appear in a supplemental web- 
page. We also record the output of standard PCA, where we recover the subspace by the span of the 
top d eigenvectors. We ran the experiments on a computer with Intel Core 2 CPU at 2.66GHz and 2 
GB memory. 

We remark that since the basic GMS algorithm already performed very well on these artificial 
instances, we did not test its extensions and modifications described in $_D(e.g., GMS2 and EGMS). 

For all experiments, we could correctly estimate d by the largest logarithmic eigengap of the 
output of Algorithm [2] Nevertheless, we used the knowledge of d for all algorithms for the sake 
of fair comparison. In particular, for LLD, OP and PCP we estimated L* by the span of the top d 
eigenvectors of the low-rank matrix. Similarly, for the common M-estimator we used the span of 
the top d eigenvectors of the estimated covariance A. For the HR-PCA algorithm we also used the 
true percentage of outliers (50% in our experiments). For LLD, OP and PCP we set the mixture 
parameter A as 0.8^/D/N, O.&^D/N, l/ v / max(D, N) respectively. 

For the common M-estimator, we used u{x) = max(ln(x)/x, 10 30 ) in (|30T >. Considering the 
conditions on convergence of (l30l in ^3.11 we also tried other functions: u{x) = max(cc -0 ' 5 , 10 30 ) 
had a significantly larger recovery error and u{x) = max(x" -°- 9 ,10 30 ) resulted in a similar recovery 
error as max(ln(x)/x, 10 30 ) but a double running time. 

We used the syntectic data with different values of (N±, No,D,d). In some instances we also 
add noise from the Gaussian distribution N(0, ry 2 I) with i] = 0.1 or 0.01. We repeated each experi- 
ment 20 times (due to the random generation of data). We record in Table @]the mean running time, 
the mean recovery error and their standard deviations. 

We remark that PCP is designed for uniformly corrupted coordinates of data, instead of cor- 
rupted data points (i.e., outliers), therefore, the comparison with PCP is somewhat unfair for this 
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kind of data. On the other hand, the applications in Q6.7\ and £16.81 are tailored for the PCP model 
(though the other algorithms still apply successfully to them). 

From Table [4] we can see that GMS is the fastest robust algorithm. Indeed, its running time is 
comparable to that of PCA. We note that this is due to its linear convergence rate (usually it con- 
verges in less than 40 iterations). The common M-estimator is the closest algorithm in terms of 
running time to GMS, since it also has the linear convergence rate. In contrast, PCP, OP and LLD 
need a longer running time since their convergence rates are much slower. Overall, GMS performs 
best in terms of exact recovery. The PCP, OP and LLD algorithms cannot approach exact recovery 
even by tuning the parameter A. For example, in the case where (iVi, No, D, d) = (125, 125, 10, 5) 
with i] = 0, we checked a geometric sequence of 101 A values from 0.01 to 1, and the smallest 
recovery errors for LLD, OP and PCP are 0.17, 0.16 and 0.22 respectively. The common M- 
estimator performed very well for many cases (sometimes slightly better than GMS), but its per- 
formance deteriorates as the density of outliers increases (e.g., poor performance for the case where 
(iVi, A^o, D, d) = (125, 125, 10, 5)). Indeed, Theorem|9]indicates problems with the exact recovery 
of the common M-estimator. 

At last, we note that the empirical recovery error of the GMS algorithm for noisy data sets is in 
the order of ^/rj, where rj is the size of noise. 



6.7 Yale Face data 



Following ICandes et al.1 (|201ll ). we apply our algorithm to face images. It has been shown that 
face images from the sam e person lie in a low-dimensional linear subspace of dimension at most 9 
( Basri and Jacobs!. 2003 ). However, cast shadows, specular reflections and saturations could possi- 
bly distort this low-rank modeling. Therefore, one can use a good robust PCA algorithm to remove 
these errors if one has many images from the same face. 



We used the images of the first two persons in the extended Yale face database B (|Lee et al 



2005), where each of them has 65 images of size 192 x 168 under different illumi nation conditions. 
There fore we represent each person by 65 vectors of length 32256. Following ( Basri and Jacobs!. 
20031) we applied GMS, GMS2 and EGMS with d = 9 and we also reduced the 65 x 32256 matrix 
to 65 x 65 (in fact, we only reduced the representation of the column space) by rejecting left vectors 
with zero singular values. We also applied the GMS algorithm after initial dimensionality reduction 
(via PCA) to D = 20. The running times of EGMS and GMS (without dimensionality reduction) 
are 13 and 0.16 seconds respectively on average for each face (we used the same computer as in 
§6-6b - On the other hand, the running times of PCP and LLD are 193 and 2.7 seconds respectively. 
Moreover, OP ran out of memory. The recovered images are shown in Figure HJ where the shadow of 
the nose and the parallel lines were removed best by EGMS. The GMS algorithm without dimension 
reduction did not perform well, due to the difficulty explained in $5] and demonstrated in £16.21 The 
GMS2 algorithm turns out to work well, except for the second image of face 2. However, other 
algorithms such as PCP and GMS with dimension reduction (D = 20) performed even worse on 
this image and LLD did not remove any shadow at all; the only good algorithm for this image is 
EGMS. 



6.8 Video Surveillance 



For background subtraction in surve illance videos (|Li et all 120041) . we consider the following two 
videos used by (ICandes et all l201ll) : "Lobby in an office building with switching on / off lights" 
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Table 4: Mean running times, recovery errors and their standard deviations for synthetic data. 



(Ni,No.D,d) 




CMS 


MDR 


LLD 


OP 


PCP 


HR-PCA 


MKF 


PCA 


M-est. 


.Ri-PCA 




g 


6e-ll 


0.275 


1.277 


0.880 


0.605 


0.210 


0.054 


0.193 


0.102 


0.121 


(125,125, 10,5) 


std.e 


4e-ll 


0.052 


0.344 


0.561 


0.106 


0.049 


0.030 


0.050 


0.037 


0.048 


77 = 


t(s) 


0.008 


0.371 


0.052 


0.300 


0.056 


0.378 


0.514 


0.001 


0.035 


0.020 




std.t 


0.002 


0.120 


0.005 


0.054 


0.002 


0.001 


0.262 


8e-06 


4e-04 


0.014 




g 


0.011 


0.292 


1.260 


1.061 


0.567 


0.233 


0.069 


0.213 


0.115 


0.139 


(125,125, 10,5) 


std.e 


0.004 


0.063 


0.316 


0.491 


0.127 


0.075 


0.036 


0.073 


0.054 


0.073 


Tj = 0.01 


Us) 


0.008 


0.340 


0.053 


0.287 


0.056 


0.380 


0.722 


0.001 


0.035 


0.052 




std.t 


0.001 


0.075 


0.007 


0.033 


0.001 


0.009 


0.364 


le-05 


4e-04 


0.069 




g 


0.076 


0.264 


1.352 


0.719 


0.549 


0.200 


0.099 


0.185 


0.122 


0.128 


(125,125, 10,5) 


std.e 


0.023 


0.035 


0.161 


0.522 


0.102 


0.051 


0.033 


0.048 


0.041 


0.050 


77 = 0.1 


t(s) 


0.007 


0.332 


0.055 


0.301 


0.056 


0.378 


0.614 


0.001 


0.035 


0.032 




std.t 


0.001 


0.083 


0.004 


0.044 


0.001 


0.001 


0.349 


7e-06 


4e-04 


0.037 




g 


2e-ll 


0.652 


0.258 


0.256 


0.261 


0.350 


0.175 


0.350 


le-12 


0.307 


(125,125,50,5) 


std.e 


3e-ll 


0.042 


0.030 


0.032 


0.033 


0.023 


0.028 


0.025 


5e-12 


0.029 


77 = 


t(s) 


0.015 


0.420 


0.780 


1.180 


3.164 


0.503 


0.719 


0.006 


0.204 


0.020 




std.t 


0.001 


0.128 


0.978 


0.047 


0.008 


0.055 


0.356 


9e-05 


0.001 


0.011 




g 


0.061 


0.655 


0.274 


0.271 


0.273 


0.355 


0.196 


0.359 


0.007 


0.321 


(125,125,50, 5) 


std.e 


0.009 


0.027 


0.039 


0.038 


0.040 


0.038 


0.038 


0.033 


0.001 


0.038 


77 = 0.01 


t(s) 


0.023 


0.401 


4.155 


1.506 


0.499 


0.653 


0.656 


0.006 


0.191 


0.028 




std.t 


0.002 


0.079 


0.065 


0.197 


0.006 


0.044 


0.377 


8e-05 


0.001 


0.022 




g 


0.252 


0.658 


0.292 


0.290 


0.296 


0.358 


0.264 


0.363 


0.106 


0.326 


(125,125, 50,5) 


std.e 


0.027 


0.033 


0.032 


0.032 


0.033 


0.027 


0.031 


0.032 


0.014 


0.032 


77 = 0.1 


t(s) 


0.021 


0.363 


0.923 


1.726 


0.501 


0.638 


0.641 


0.006 


0.191 


0.025 




std.t 


0.001 


0.063 


0.033 


0.470 


0.009 


0.051 


0.240 


le-04 


0.001 


0.012 




g 


3e-12 


0.880 


0.214 


0.214 


0.215 


0.332 


0.161 


0.330 


2e-12 


0.259 


(250, 250, 100, 10) 


std.e 


2e-12 


0.018 


0.019 


0.019 


0.019 


0.014 


0.024 


0.012 


9e-12 


0.016 


77 = 


t(s) 


0.062 


1.902 


3.143 


7.740 


2.882 


1.780 


1.509 


0.039 


0.819 


1.344 




std.t 


0.006 


0.354 


4.300 


0.038 


0.014 


0.041 


1.041 


3e-04 


0.023 


0.708 




g 


0.077 


0.885 


0.217 


0.216 


0.219 


0.334 


0.164 


0.335 


0.009 


0.263 


(250,250, 100, 10) 


std.e 


0.006 


0.031 


0.019 


0.018 


0.020 


0.019 


0.019 


0.017 


3e-04 


0.018 


77 = 0.01 


t(s) 


0.084 


1.907 


21.768 


11.319 


2.923 


1.785 


1.412 


0.039 


0.400 


1.086 




std.t 


0.010 


0.266 


0.261 


0.291 


0.014 


0.041 


0.988 


3e-04 


0.002 


0.738 




g 


0.225 


0.888 


0.238 


0.237 


0.262 


0.342 


0.231 


0.345 


0.136 


0.276 


(250,250,100, 10) 


std.e 


0.016 


0.020 


0.019 


0.019 


0.019 


0.019 


0.018 


0.015 


0.010 


0.019 


T] = 0.1 


t(s) 


0.076 


1.917 


4.430 


16.649 


2.876 


1.781 


1.555 


0.039 


0.413 


1.135 




std.t 


0.007 


0.299 


0.069 


1.184 


0.014 


0.025 


0.756 


4e-04 


0.011 


0.817 




e 


4e-ll 


1.246 


0.162 


0.164 


0.167 


0.381 


0.136 


0.381 


3e-13 


0.239 


(500, 500, 200, 20) 


std.e 


le-10 


0.018 


0.011 


0.011 


0.011 


0.010 


0.009 


0.008 


6e-14 


0.009 


77 = 


t(s) 


0.464 


23.332 


16.778 


89.090 


16.604 


8.602 


5.557 


0.347 


6.517 


15.300 




std.t 


0.024 


2.991 


0.878 


1.836 


0.100 


0.216 


4.810 


0.009 


0.126 


3.509 




e 


0.082 


1.247 


0.160 


0.162 


0.166 


0.374 


0.139 


0.378 


0.012 


0.236 


(500, 500, 200, 20) 


std.e 


0.003 


0.018 


0.007 


0.007 


0.008 


0.011 


0.010 


0.006 


2e-04 


0.007 


77 = 0.01 


t(s) 


0.592 


23.214 


128.51 


122.61 


16.823 


8.541 


6.134 


0.354 


2.361 


15.165 




std.t 


0.060 


3.679 


1.155 


6.500 


0.036 


0.219 


4.318 


0.019 


0.064 


3.485 




e 


0.203 


1.262 


0.204 


0.204 


0.250 


0.391 


0.275 


0.398 


0.166 


0.270 


(500, 500, 200, 20) 


std.e 


0.007 


0.012 


0.007 


0.007 


0.007 


0.012 


0.272 


0.009 


0.005 


0.008 


77 = 0.1 


t(s) 


0.563 


24.112 


24.312 


202.22 


16.473 


8.552 


8.745 


0.348 


2.192 


15.150 




std.t 


0.061 


2.362 


0.226 


8.362 


0.050 


0.155 


3.408 


0.010 


0.064 


3.420 
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(a) 



(b) 



(c) 



(d) 



(e) 



(g) 



Figure 4: Recovering faces: (a) given images, (b)-(f) the recovered images by EGMS, GMS without 
dimension reduction, GMS2, GMS with dimension reduced to 20, PCP and LLD respec- 
tively 
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and "Shopping center" from http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html . In the 



first video, the resolution is 160 x 128 and we used 1546 frames from 'SwitchLightlOOO.bmp' to 
'SwitchLight2545.bmp'. In the second video, the resolution is 320 x 256 and we use 1000 frames 
from 'ShoppingMalll001.bmp' to 'ShoppingMall2000.bmp'. Therefore, the data matrices are of 
size 1546 x 20480 and 1001 x 81920. We used a computer with Intel Core 2 Quad Q6600 2.4GHz 
and 8 GB memory due to the large size of these data. 

We applied GMS, GMS2 and EGMS with d = 3 and with initial dimensionality reduction to 
200 to reduce running time. We obtain the foreground by the orthogonal projection to the recovered 
3-dimensional subspace. Figure[5]demonstrates foregrounds detected by EGMS, GMS, GMS2, PCP 
and LLD, where PCP and LLD used A = l/^max(D,N), 0.8^/D/N. We remark that OP ran out 
of memory. Using truth labels provided in the data, we also form ROC curves for GMS, GMS 2, 
EGMS and PCP in Figure [6] (LLD is not included since it performed poorly for any value of A we 
tried). We note that PCP performs better than both GMS and EGMS in the 'Shoppingmall' video, 
whereas the latter algorithms perform better than PCP in the 'SwitchLight' video. Furthermore, 
GMS is significantly faster than EGMS and PCP. Indeed, the running times (on average) of GMS, 
EGMS and PCP are 91.2, 1018.8 and 1209.4 seconds respectively. 

7. Proofs of Theorems 
7.1 Proof of Theorem [J 

We will prove that any minimizer of ((U) is also a solution to the oracle problem ©, by showing that 
Qo m © satisfies that 

F(Q + A) - F(Qo) > for any symmetric A with tr(A) = and AP L * ^ 0. (46) 
First, we differentiate ||Qx|| at Q = Qo when x G ker(Qo) 1 " as follows: 



d 



|Qx| 



dQ Q=Qo dQ 



d Qxx r Q T 



Qoxx J + xx J Q 







Q=Qo 2||Q x| 



Q=Qo dQ 2||Q x 
We note that for any x G R D \ {0} satisfying Qox / and A G R DxD symmetric 

||(Qo + A)x|| - HQoxll > (A, (Q xx t + xx T Q )/2||Q x||\ = (a, Q xx t /||Q x 



(47) 



F \ IF 

(48) 

Indeed the first equality follows from (l47l) and the convexity of ||Qx|| in Q and the second equality 
follows from the symmetry of A and Qo as well as the definition of the Frobenius dot product If on 
the other hand Qox = 0, then clearly 

||(Qo + A)x|| - HQoxll = ||Ax||. (49) 



For simplicity of our presentation, we use d49l ) only for x G X\ (where obviously Qox = 
since QoPl* = 0). On the other hand, we use (|48"1) for all x G Xq. One can easily check that if 
x G Xq and Qox = 0, then replacing (l48l with d49b does not change the analysis below. Using 
these observations we note that 

F(Qo + A)-F(Q ) > Yl H Ax ll+ E (a,Q xx t /||Q x||)^. (50) 
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Figure 5: Video surveillance: (a) the given frames (b)-(e) the detected foreground by EGMS, GMS, 
GMS2, PCP, LLD respectively 




Figure 6: ROC curves for EGMS, GMS, GMS2 and PCP the in ' SwitchLight' video (the left figure) 
and the ' ShoppingmalV video (the right figure) 
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We assume first that AP L . = 0. In this case, Q + A £ i and (Qo + A)P L . = 0. Since Qo 
is the minimizer of ©, we obtain the following identity (which is analogous to (l36l)): 

y-/ A Qoxx^\ > o V A G M. DxD s.t. tr(A) = , APl* = 0. (51) 

xta iiQoxii/ F - 



Now we will prove (I46I ). Using (1501 ) and the facts that X\ C L* and Qo = P L ,i Qo (since 
Pl* Qo = QoPl* = Q), we establish the following inequality: 

F(Qo + A)-F(Q ) > H Ax ll+ E (a,Q xx t /||Q x|| 

xeAx xSAb 

= J2 l|AP L *x||+ ((AP l , +AP l ,x),P l ^Qoxx t /||Q x|| 

xeAi xeAb 

> ]T (||Pl*AP l .x|| + ||P l »xAP l »x||)/^/2 

xe^i 

+ ((Pl-^Pl- + P L *x AP L ,x), Qoxx^/IIQoxIl)^ . (52) 

For ease of notation we denote Ao = tr(PL* APL*)vov r , where vo is the minimizer of the 
RHS of ©. Combining the following two facts: tr(A ) -tr(P L * AP L . ) = and tr(P L * AP L . ) + 
tr(P L *x AP L *x) = tr(A) = 0, we obtain that 

tr(A + P L ,xAP L ,x) = 0. (53) 

Further application of (f5TT > implies that 

£ (Ao + P l ^AP l ,x,Qoxx t /||Q x||) f > 0. (54) 

We note that 

P L ,xAP L »x,Q xx T /||Q x||) = (p L .xAP L .xP L .x,Q xx T /||Qox|| / 

/ F \ IF 

P L ,x AP L ,x, Q xx r P L ,x/||Qox||) . (55) 

/ F 



Combining (1541) and (155 b we conclude that 

- E (Pl^AP l »x,Q xx t /||Qox||) f < £ (A ,Qoxx r P L *x/||Q x| 
= ^ tr(P L .AP L .)(vg , Qox/||Qox||)(vg , P L .xx) < | tr(P L . AP L *)| £ Iv^. (56) 

x6A" xeA'o 

We apply ([56]) and then use © with Q = P L * AP L * / tr(P L * AP L *) to obtain the inequality: 
||Pl*AP l *x||/v / 2+ Yl (Pl^AP l ,x,Q xx t /||Q x| 

xGA'i xeA" 
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> Y ||Pl*AP l ,x||/V2- |tr(P L .AP L .)| Y l v o x *l > °- ( 57 ) 

We define Hi = {Q G H : QP L »± = 0} and claim that Ql leads to the following inequality: 

Y ||Q(P L -x)|| > V2 Y IIQ( p L*x)|| VQ G Ml (58) 

xe^i xeA'o 

Indeed, since the RHS of (l58l ) is a convex function of Q, its maximum is achieved at the set of all 
extreme points of Hi, which is {Q G K DxD : Q = vv T , where v G L*,||v|| = 1}. Therefore the 
maximum of the RHS of d58l) is the RHS of ©. Since the minimum of the LHS of (I58T ) is also the 
LHS of ©, 458]) is proved. 

We also claim that Hi can be extended to all Q G M. DxD such that QP L *± = 0- Indeed, we 
assign for any such Q (Q G R DxD satisfying QP L ,i = 0) with SVD Q = USV T the matrix 
Q' = VSV T / tr(VEV T ) and note that Q' G Hi. It is not hard to note that the inequality in d58l) 
holds for Q if and only if it also holds for Q'. 

Now we apply Q with Q = P l ,iAPl« to obtain the other inequality: 

Y ||P L *xAP L ,x||/V2+ Y (P L ^AP L ,,Q xx T /||Q x||) F 

xSAi xSAb 

> Y l|P L ^AP L *x||/V2- Y I|Pl^AP l *x|| 

= Y I|Pl^AP l ,(P l ,x)||/V2 - Y I|Pl*-lAP l .(P l .x)|| > (59) 
xeAi xeA'o 

Finally, we combine (I57T ) and (l59l and conclude that the RHS of (l52l is nonnegative and conse- 
quently (|46T ) holds. 

7.2 Proof of Theorem E] 

Since F is convex, we only need to exclude the case of non-strict convexity, i.e., we need to exclude 
the case 

F(Qx) + F(Q 2 ) = 2 F((Qi + Q 2 )/2) for Ch + Q 2 . (60) 

If 460]) is true, then 

AT N N 

Y IIQlXiH + Y ll Q2X *ll = IKQl + Q2)xi||. (61) 

i=l i=l i=l 

Combining (loTT) with the fact that ||QiXj|| + ||Q2Xj|| > ||(Qi + Q2)xj||, we obtain that ||QiXj|| + 
||Q2Xj|| = || (Qi + Q2)xj|| for any 1 < i < N and therefore there exists a sequence {ci}f =l C M 
such that 

Q2Xj = or QiXj = Q Q2Xj for all 1 < i < N. (62) 

We conclude Theorem|2]by considering two different cases. We first assume that ker(Qi) = 
ker(Q2). We denote 

Ql = Pker(Qi) ± QlPkcr(Qi)- L and Q2 = Pker(Qi)-L Q2Pkcr(Qi)-L ■ 
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It follows from ([62]) that 

Ql( P kcr(Qi)^ x i) = °i Q2(Pkcr(Q 1 )^ x i) 

and consequently that Pker(Q 1 )- LX i nes m one °f tne eigenspaces of Q^f x Q2- We claim that Q^Q.2 
is a scalar matrix. Indeed, if on the contrary Q^Q.2 is not a scalar matrix, then {Pkcr(Qi) xX «}^=i 
lies in a union of several eigenspaces with dimensions summing to dim(P ker /Q 1 \± ) and this contra- 
dicts (TT0T > - In view of this property of Q^ X Q2 and the fact that tr(Qi) = tr(Qi) = 1 we have that 
Qi = Q2 and Qi = Q2, which contradicts our current assumption. 

Next, assume that ker(Qi) 7^ ker(Q2). We will first show that if 1 < i < N is arbitrar- 
ily fixed, then x; £ ker(Q2) U ker(P ker (Q 1 )Q2). Indeed, if x, ^ ker(Q2), then using (l62l) 
we have Qix^ = a Q.2Xj. This implies that CiP k er(Qi)Q2Xi = Pker(Qi)Qi x i = Q and thus 
Xj E ker(P kcr (Q 1 )Q2). That is, X is contained in the union of the 2 subspaces ker(Q2) and 
ker(Pj ;er (Q 1 )Q2). The dimensions of both spaces are less than D. This obvious for ker(Q2), since 
t r (Q2) = 1. For ker(P kcr (Q 1 )Q2) it follows from the fact that ker(Qi) 7^ ker(Q2) and thus 
Pker(Qi)Q2 7^ Q. We thus obtained a contradiction to (fTOl) . 



7.3 Verification of (fT3l) and (fT4l) as Sufficient Conditions and (031 ) and (fT6l) as Necessary Ones 

We revisit the proof of Theorem Q] and first show that (PT31 and (fT4l can replace © and ([7]) in the 
first part of Theorem [TJ Following the discussion in £12.61 © guarantees that (fT3l) and (fT4l are well 
defined and thus we only deal with the first part of Theorem Q] 

To show that (fl4l can replace ©, we prove the inequality in (l59l) using (fT4l as follows. Assum- 
ing that the SVD of P L ,iAP L . is USV T , then Q' := V£V T /tr(5]) satisfies Q' G H, Q'P L ,x = 



and ||Q'x| 
obtain that 



|P L ,xAP L *x||/tr(£) 



.AP L .x||/||P L .xAP r 



Using this fact, we 



||P L *xAP L ,x|| > ||P L ,xAP L *| 



mm 



Q'SH,Q'P L »x=Q x6;ti 



(63) 



We also note that 



]T(p L ,xAP L .,Qoxx r /||Q x||) = Wp l ,xAP l ,,Q xx t P l ,/||Qo x 



> 



AP T 



^Q xx t P l V||Q x| 



xeA"o 



(64) 



Therefore d59l ) follows from (fT4l . d63l ) and d64"l ). Similarly, one can show that ( [TBI may replace ©. 

One can also verify that (03]) and (fT6l) are necessary conditions for exact recovery by revisiting 
the proof of Theorem Q] and reversing inequalities. 



7.4 Proof of Lemma H 

We first note by symmetry that the minimizer of the LHS of (fT8l) for the needle-haystack model is 
Q = Pl*/(I We can thus rewrite (fT8T ) in this case as aiMn/d > 2\^2aoKro/{D — d), where 
the "radii" n and r$ are the norms of the normal distributions with covariances a\d~ 1 ~Pi l * and 
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GqD 1 Pl»± respectively. Let fi and V2 be the ^-distributed random variables with d and D — d 
degrees of freedoms, then (fT8l) obtains the form 

ai(Tl TU ~ ^ 2V2a (J0 rr, ~ 

■xLri> = Jk.ro. 



dVd 



(D - d)VD 



Applying (B.7) of Lerman etaD d2012h . Efi > and Ef < V£> - 3. Therefore (fl8T> 

follows from (fTTl) . 



7.5 Proof of Theorem II 

For simplicity of the proof we first assume that the supports of \xq and \i\ are contained in a ball 
centered at the origin of radius M. 

We start with the proof of Q "in expe ctation" and then extend it to hold with high probabil- 
ity. Following ICoudron and Lermanl (120120 . we define the following expected version of F and its 
oracle minimizer (where the subscript / indicates integral): 



|Qx|| dfj,(x) 



and 



Qi = argmin Fj(Q). 

QeH,QP L «=Q 



(65) 



(66) 



The spherical symmetry of ^ ,L* X implies that Qj = ^P L ,iP^ ti is the unique minimizer 
of (l66l). To see this forma lly, we first note that /i L *± satisfies the two-subspaces criterion of 



Coudron and Lermanl (120121) for any < 7 < 1 (this criterion general i zes (fTOl ) of this paper to 
continuous measures) and thus by Theorem 2.1 of ICoudron and Lermanl (|2012h (whose proof fol- 
lows directly the one of Theorem 2 here) the solution of this minimization must be unique. On the 
other hand, any application of an arbitrary rotation of L* (within R D ) to the minimizer expressed 
in the RHS of should also be a minimizer of the RHS of We note that -p^P L „xPj^ ± 
is the only element in the domain of this minimization that is preserved under any rotation of L* . 
Therefore, due to uniqueness, this can be the only solution of this minimization problem. 
Let 

i 2 = {Q6H: QP L * =0, Q h Q and cond(P L ,±QP L ,±) > 2}, 

where Q >z Q denotes the positive semidefiniteness of Q and cond(P L «±QP L „±) denotes the 
condition number of this matrix, that is, the ratio between the largest and lowest eigenvalues of 
P L *±QP L »±, or equivalently, the ratio between the top eigenvalue and the (D — d)th eigenvalue of 
Q. Since Qj is the unique minimizer of d66l ) and Qj ^ H 2 , then 



ci := min(F / (Q)-F / (Q/))>0. 

QeH 2 



(67) 



We note that if x is a random variable sampled from \i and Q G H (so that ||Q|| < ||Q||* = 1), then 
||Qx|| < M. Applying this fact, (l67l ) and Hoeffding's inequality, we conclude that for any fixed 

Qei 2 



F(Q) - F(Q/) > ciA^/2 w.p. 1 - exp(-cfiV/2M 2 



(68) 
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We also observe that 



F(Qi) - F(Q 2 ) < ||Qi - Q 2 || INI < IIQi " Q2WNM. 



(69) 



8=1 



Combining (l68l) and d69l , we obtain that for all Q in a ball of radius r\ := c\/2M centered around 
a fixed element in M 2 : F(Q) - F(Q r ) > w.p. 1 - exp(-c?iV/2M 2 ). 

We thus cover the compact space H2 by an ri-net. Denoting the corresponding covering number 
by N(M.2, ri) and using the above observation we note that w.p. 1 — N(M.2,ri) exp(— cf N/2M 2 ) 



F(Q) - F(Qj) > for all Q G H 2 . 



(70) 



The definition of Qo (that is, ([8])) implies that F(Qo) < F(Qj). Combining this observation 
with (f70l . we conclude that w.h.p. Qo ^Efe. We also claim that Qo t: (see e.g., the proof of 
Lemma [T6l which appears later). Since Qo ^^-2 and Qo h Q, Qo satisfies the following property 
w.h.p.: 

cond(P^ x QoP^x) < 2. (71) 

Consequently, © holds w.h.p. (more precisely, w.p. 1 - N(W 2 , c x /2M) exp(cf N/2M 2 )). 

Next, we verify (fT3l) w.h.p. as follows. Since Qo is symmetric and QqPl* = Q (see ((8])), then 



Qo = P L *xQoP L - 



(72) 



Applying (T72T >. basic inequalities of operators' norms and (T7TT >. we bound the RHS of (fl"3l from 
above as follows: 



V2 



<V2- 



Qoxx T P L »x/||Q x| 



V2 



P L ,xQ P L ,x • ^ P L ^xx T P L «x/||Q x| 

x&X 



P L ,xQ P L »- 



P L ,xxx T P L «x/||Q x| 

xe^o 



<V^-A max (P L .±Q P L .±)- || Yl PL^xx T P L ,x/||A min (P L «xQ P L »x)P L ,xx|||| 

<M\ E Pl^xx t P l ,x/||P l ,xx|||| = max V8u T ( ^ P L ,xxx T P L ,x/||P L *xx||)u. 

Therefore to prove ( fT3T ), we only need to prove that with high probability 



min V ||Qx|| > max V8u T ( V P L «xxx T P L ,x/||P L ,xx||)u. (73) 
QeH,QP L , ± =0,^ 



xex 



We will prove that the LHS and RHS of (1731 concentrates w.h.p. around the LHS and RHS of 
( fT8T ) respectively and conse quently verify d73l) w.h.p. Le t ei be the difference between the RHS and 
LHS of (|73l . Theorem 1 of ICoudron and Lermanl d2012l) implies that the LHS of £73) is within dis- 
tance ei/4 to the RHS of (fT8T ) with probability 1 — C exp(— Nf C) (where C is a constant depending 
on ei, fi and its parameters). 
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The concentration of the RHS of (fl8T ) can be concluded as follows. The spherical symmetry 
of /U 0L *x implies that the expectation (w.r.t. /io) of Ylxex PL* xXxT f > L* ± /ll-E , L* xX ll * s a scalar 
matrix within L*- 1 , that is, it equals p M P L ,xxx T P L „x/||P L .xx|| for some p M G R. We observe 
that 

E w tr(P L ,xxx T P L ,x/||P L ,xx||) = E M ||P L ,xx|| (74) 
and thus conclude that = E M0 ||P L »xx||/(/J - d). Therefore, for any u G S' 13-1 n L*- 1 

E„ u T (P L »xxx T P L ,x/||P L ,xx||)u = E„ ||P L .xx||/(Z> - d) = f ||P L .xx|| d/i (x)/(£> - d). 



(75) 

We thus conclude from (l75l) and Hoeffding's inequality that for any fixed u G S^" 1 flL* 1 
the function V8u T (J2 xe x P L* xXxTp L* x /ll P L* xX ll) u is within distance ei/4 to the RHS of (fl8T ) 
with probability 1 — C7exp(— N/C) (where C is a constant depending on ei, \i and its parame- 
ters). Furthermore, applying e-nets and covering (i.e., union bounds) arguments with regards to 
s d-i n L *± ? we obtain that for all u e 5D-1 n L *± ? ^8u T (^ xg ^ P L ,xxx T P L *x/||P 



u 



is within distance e±/2 to the RHS of (fl8T ) with probability 1— Cexp(— N/C) (where C is a constant 
depending on e±, \i and its parameters). In particular, the RHS of (|73T ) is within distance ei/2 to the 
RHS of £l8]> with the same probability. We thus conclude |73]) with probability 1-C exp(-N/C). 

Similarly we can also prove (fl4l) . noting that the expectation (w.r.t. po) of Qoxx t Pl* /||Qo x II 
is 0, since Qo x /||Qo x || and x t Pl* are independent when x is restricted to lie in the complement 
of L* (that is, x G X ). 

If we remove the assumption of bounded supports (with radius M), then we need to replace 
Hoeffd ing's inequality with th e Hoeffding-type inequality for sub-Gaussian measures of Proposition 



5. 10 of lVershyninl (Ito appearh . where in this proposition ctj = 1 for all 1 < i < n. 

We emphasize that our probabilistic estimates are rather loose and can be interpreted as near- 
asymptotic; we thus did not fully specify their constants. We clarify this point for the probability 
estimate we have for ©, that is, 1 - N(M 2 , c\/2M) exp(cf N/2M 2 ). Its constant iV(H 2 , r\) can 
be bounded from above by the covering number iV(IHIo, r±) of the larger set Mq = {Q G TC 



Q?:,?:l < 1 }. which is bounded from above by (8/ri) D( - D (see e.g., Lemma 5.2 of (I Vershynin . 



to appean ')'). This is clearly a very loose estimate that cannot reveal interesting information, such as, 



the right dependence of N on D and d in order to obtain a sufficiently small probability. 

At last, we explain why (fTOl) holds with probability 1 if there are at least 2D — 1 outliers. 
We denote the set of outliers by {yi, V2, ■ ■ ■ ,yAr }> where > 2D — 1, and assume on the 
contrary that (fTOl) holds with probability smaller than 1. Then, there exists a sequence C 
{1, 2, 3, • • • , A^o} such that the subspace spanned by the D — 1 points , yj 2 , ■ ■ ■ , Y%d-i contains 
another outlier with positive probability. However, this is not true for haystack model and thus our 
claim is proved. 

7.6 Proof of Theorem H 



(A. 15) of iLerman et all (|2012h as follows 



This proo f follows ideas of ILerman et al.l (120121) . We bound from below the LHS of |7]) by applying 



min IIQ X II ^ ~~j= mm l vTx l- (76) 
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We denote the number of inliers sampled from yt,\ by N± and the number of outliers sampled from 
Mn by Nq(= N — N -j ). We bound from below w.h.p. the RHS of (l76l) by applying Lemma B.2 of 



Lerman et al. ( 2012h in the following way: 



^ l vTx l > ^[VV^Ni - 2^/W[d - ty/Wij w.p. 1 - e"' 2/2 . (77) 



mm 



By following the proof of Lemma B.2 of iLerman et all (120121) we bound from above w.h.p. the RHS 
of © as follows 

max V |v T x| < -^=(J2/^N + 2jN^d + ty^I%) w.p. 1 - e^* 2/2 . (78) 
-Mlv||=i vD v / 



v6 lm|v||=1 x ^ VD 

We need to show w.h.p. that the RHS of <|78]> is strictly less than the RHS of d77l ). We note that 
Hoeffding's inequality implies that 

iVi > aiJV/2 w.p. 1 - e ~ a27V/2 and \N - a N\ < a N/2 w.p. 1 - 2e~ a o N/2 . (79) 

Furthermore, d2"0l and d79l imply that 

d < ATi/4 w.p. 1 - e ~ a ^ N/2 and d < iV /4 w.p. 1 - e~ a o N/2 . (80) 



Substituting i = (> ^"1^/20 w.p. 1 - e - a \ N l 2 ) in O and t = y/No/10 (> 

VooiV /20 w.p. 1 - 2e~ a o N / 2 ) in CZSJ and combining (QUl and (|7gll-(l80l. we obtain that © holds 
w.p. 1 - e ~°^ N l 2 - 2e~ a ° N / 2 - e-^/soo _ e -a 7V/800_ We can similarly obtain that © holds 

with the same probability. 

7.7 Proof of Theorem |6] 

We first establish the following two lemmata. 

Lemma 15 If X satisfies (1101) . then for any eq > diere ex/st a positive constant 71 = eo) 
smc/i that for any symmetric matrix A w/f/i tr(A) = and || A||p < eo.' 

F(Q + A)-F(Q) > 7l (A', e o)||A|| 2 ,. (81) 

Proof We first prove by contradiction that the second directional derivative of F(Q) at any Q £ i 
and any direction A is positive, that is, 

t^o t z 

Assume that there exists a direction Ao with tr(Ao) = 0, ||Ao||f = 1 ( or ll^olb = 1) and 
F± (Q) = 0. Since ||Qx.j|| is a convex function w.r.t. Q, we obtain that 



d 2 „A 

-^2 HQxj + tA Xi 



= for any 1 < i < N. (83) 

t=o 



Combining (1831 ) with the following observation 



d 2 |, u , t ,,_ ||v|| 2 ||P v x(u + tv)f 
dt 2 ||u + tv|| 3 
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we have that for any fixed 1 < i < N: P(A oX j) ± Q Xi = ^ anc ^ consequently Qx^ = qAoXj for 
some Cj € M. We obtain the desired contradiction by (flOl) (following the proof of Theorem [2]). 
Finally, we define the sets 

y(e ) :={QeH: ||Q - Q|| F < e } x {A : tr(A) = and || A|| F = 1}, (84) 

y(e ) : ={Qei: ||Q - Q\\ F < e } x {A : tr(A) = and || A|| 2 = 1} (85) 
and the numbers 

1i(X,eo):=l min F£(Q) and ^(X, e ) := \ min F£(Q). (86) 

2 (Q,Ao)e^(eo) 2 (Q,A )6y(eo) 

It follows from the compactness of y(eo) and the positiveness and continuity of the second direc- 
tional derivatives that ji(X, eo) and j{(X, eo) > are positive. The definitions and positivity of 
ji(X, eo) and j[(X, cq) and the continuity of second derivatives clearly implies d8T| ). ■ 



Lemma 16 The minimizer of F(Q), Q, is a semi-definite positive matrix. 

Proof We assume that Q has some negative eigenvalues and show that this assumption contradicts 
the defining property of Q, i.e., being the minimizer of .F(Q). We denote the eigenvalue decompo- 
sition of Q by Q = V^S^VT and define £+ = max(£!Q, 0) and Q+ = V^St VT/ tr(St ) <E 

M. Then tr(Ef ) > tr(£ 6 ) = tr(Q) = 1 and for any x G R D we have 

||Q+x|| < tr(E+)||Q+x|| = ||St(vT x )|| < ||Eq(vTx)|| = ||Qx||. 
Summing it over all x S X, we conclude the contradiction F(Q + ) < i^Q). 



In order to prove Theorem [6] we first bound ||Q — Q\\f by 2 and then apply Lemma [TBI to 
bound it by the desired estimate. In view of Lemma [TBI the singular values of Q coincide with its 
eigenvalues and thus 



IQIIf 



D 



D 



, ]T^(Q)< J> 4 (Q)=tr(Q) = l. 
\ i=i 



(87) 



i=i 



Similarly, ||Q|| F = 1 and consequently 

HQ - Q\\f < \\Q\\ F + \\Q\\f < 2. 

Next, we observe that 

N N N 



(88) 



N 



\F x (Q)-Fx(Q)\<J2 iQxill-HQxil K^llQCxi-XiJU^J^IIxi-Xill^^Ci (89) 



i=i 



i=l 



i=l 
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and similarly \F X (Q) - F^(Q)| < YlZi e i- Therefore, 

FAQ) - FAQ) = (FAQ) - F AQ)) + ( F AQ) - F AQ)) + (FAQ) 

N 

-F x (Q))<0+\F x (Q)-FAQ)\ + \FAQ)-F X (Q)\<2j2e l . (90) 



i=l 



We define 70 = 71 (X, 2) and note that equation (|2TT > follows from ( f8TT > (with en = 2, and A = 
Q — QX d90b and (l88l) . Applying the Davis-Kahan perturbation Theorem (IDavis and Kahanl . ll970f) 
to (f2TT >. we conclude 



7.8 Proof of Proposition S 

We recall the set y(2), which was defined in d84l ) with eo = 2, and the function Fj, which was 
defined in d65l ). The notation F/^(Q) should be clear from d82~l ), where now i 7 / replaces F. 
We define the set 

Z = {Q e H : IIQUjt < 3} x {A : tr(A) = and ||A|| F = 1}. (91) 

and the constant 

co(/u) := min -F/a(Q) 

(Q,A)6y 2 

and note that the proof of Lemma [3] and the compactness of Z imply that co(//) > 0. We make 
two observations that conclude the proof. First of all, Z contains 3^(2) and consequently 70 > cq. 
Second of all, the law of large numbe rs implies that F'^(Q)/N — > Fj'^(Q) almost surely for any 
A and Q (see also related bounds in ICoudron and Lermanl (120121) ). By combining the above two 
facts, we immediately conclude (124)) for 70 and c$; the proof is identical for j' and c' (one only 
needs to replace || A|| F = 1 with || A||2 = 1 in d9~TT ). 

7.9 Proof of Theorem 1 

The theorem follows from the observation that < F(Q) - F S (Q) < N5/2 for all Q G H and the 
proof of Theorem [6l 

7.10 Proof of Theorem|9] 

It is sufficient to verify that 

If A g M. DxD with Im(A) = L*, then L(A + 77I) -»• 00 as 77 -> 0. (92) 

Indeed, since L(A) is a continuous function, (l92l) implies that L(A) is undefined (or infinite) and 
therefore A is not the minimizer of d29l as stated in Theorem |9] 

We fix a\ < lim^^oo xu(x) and note that Condition Do (w.r.t. L*) implies that 

\X \/N> (D-d)/ ai . (93) 

Condition M implies that there exists x\ such that for any x > x\\ xu(x) > a\ and therefore 
(recalling that u = p') p(x) > a\ ln(x — x\)/2 + u{x\)/2. Thus for any Xj £ Xq, we have 

p(xj (A + r?I) -1 Xi) > m ln(l/r/ - xi)/2 + Q for some constant d = Cj(xj, A) (94) 
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and 

N 

— log |A| < NC + {D- d)/21n(??) for some C = C (A). (95) 
Equation d92l ) thus follows from d93l)-(l95l) and the theorem is concluded. 

7.11 Proof of Theorem [H 

The derivative of the energy function in the RHS of (l32l is QX T X + X T XQ. Using the argument 
establishing (l38l) and the fact that Q2 is the minimizer of (l32l ). we conclude that QX T X + X T XQ 
is a scalar matrix. We then conclude d33l ) by using the argument establishing d39l as well as the 
following two facts: tr(Q2) = 1 and X is full rank (so the inverse of X T X exists). 



7.12 Proof of Theorem [12] 

We frequently use here some of the notation introduced in ^4.11 in particular, I(Q), L(Q) and 
T(Q). We will first prove that F(Qk) > F{Qk+i) f° r an fc > 1. For this purpose, we use the 
convex quadratic function: 



1 N 

G(Q,Q*) = - (IIQxif/llQ^II + HQ*^ 



i=l 



Following the same derivation of (|47T ) and (I38T ). we obtain that 



Q=Q fe+ i 



Q 



k+l 



V 



N 

E 

i = l 



T 

|QfcXj| 







N 


+ 




E 






^(Q fe 



x,-x- 



Qfc 



/ 



We let A fc = E 4 =i, ^/( Qfc ) c fc = P L(Qfe) ± A fc ^q^x and for any symmetric A € 

with tr(A) = and P L ( Qfc ) A = we let A = AP L(Qfc) ±. We note that 



>DxD 



tr(Ao) = (A ,I) F = (P^ (Qfc) xAP L(Qfe) x,I 



A 'PL(Q fe ) ± ^L(Q fc ) ± / F 



= < A, I - P L{Qfc) > F = (A, I) F - (A, P L(Qfc) > F = (A, I) F = tr(A) = 0. 

Consequently, we establish that the derivative of G(Q, Qfc) at Qfc + i in the direction A is zero as 
follows. 

((Qfc+i Afc + A fc Q fc+1 )/2, A) F = (Qfc+iAfc, A) F = c fc (p L(Qfc) ± A^P^^x A fc , A 



=c fe (P 



UQ k ) ±A k P L(Q fc )- LA fc'- t 'L(Q fe )-L^0i' L( Q fe) 



Afc,P T 



AnP 



T 



Cfc((P 



A^P 



L(Q fc ) ±A fc " t 'L(Q fc ) ± )(PL(Q fc ) ±Afcr 'L(Q 



A,P T 



A ) = Cfc (I, A ) F = . 

F 



This and the strict convexity of G(Q, Qfc) (which follows from Sp({xi}j^/(Q fe )) = M. D using (fTOl l) 
imply that Qfc + i is the unique minimizer of G(Q, Qfc) among all Q £ H such that P L (Q fc )Q = 0. 
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Combining this with the following two facts: Qfc +1 Xj = for any i G I(Qk) and G(Q k , Qfc) = 
F(Qk), we conclude that 

Trtr\ \ no n IIQfc+i x i|l IIQfc x «ll 

nQfc+i) = IIQfc+i x ill = z> HQ || 

< £ " Qfc+1 9Nn + "i^" 2 = G(Qfc+i»Qfc) < g(Qfe,Q*) = f (Q*)- (96) 

Since F is positive, F(Qfc) converges and 

F(Q fc ) - F(Q fe+ i) -> as fc -> oo. (97) 
Applying (|96l ) we also have that 

(98) 



F(Q k )-F(Q k+1 )>G(Q k ,Qk)-G(Q k+1 ,Q k ) = l ]T ||(Q fc - Q fc+1 )xi|| 3 /||Q*x 



2 

We note that if Qfc / Q fc+1 , then Sp({xi}^ /{Qfc) ) =R D D ker(Qfc-Qfc +1 ) and l/||Q fc Xi|| > 
1/ maxj ||xj||. Combining this observation with ([971 and d98l ) we obtain that 

HQfc - Qfe+l||2 ->• as/c^oo. (99) 

Since for all G N, Qfc is nonnegative (this follows from its defining formula (l42l) ) and 
tr(Qfc) = 1, the sequence {QfcjfcgN lies in a compact space (of nonnegative matrices) and it thus 
has a converging subsequence. Assume a subsequence of {Qfcj-fceN, which converges to Q. We 
claim the following property of Q: 

Q = argminF(Q), where i :={Qei: ker Q D L(Q)}. (100) 

QeH 

In order to prove (1 1001) . we note that (l96l) and the convergence of the subsequence imply that 
F(Q) = F(T(Q)). Combining this with d96j) (though replacing Q fc and Q fc+1 in d%]) with Q and 
T(Q) respectively) we get that G(T(Q), Q) = G(Q, Q). We conclude that T(Q) = Q from this 
observation and the following three facts: 1) Q = Q is the unique minimizer of G(Q, Q) among 
all QsH,2) p L (Q)Q = Q, 3) Q = T(Q) is the unique minimizer of G(Q, Q) among all Q G U 
such that P L( -q^Q = (we remark that F(Q) is strictly convex in H and consequently also in Ho 
by Theorem|2]). Therefore, for any symmetric A G R DxD with tr(A) = and P L ^A = 0, the 
directional derivative at Q is 0: 

d ~vi \ / . ~ \— \ x,x!r 

! E 



We note that <1 1 1 b is the corresponding directional derivative of F(Q) when restricted to Q G Ho 
and we thus conclude (1 100b . 

Next, we will prove that {QfcjfceN converge to Q by proving that there are only finite choices for 
Q. In view of (1 1001 ) and the strict convexity of F(Q) in Hq, any limit Q (of a subsequence as above) 
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is uniquely determined by I(Q). Since the number of choices for I(Q) is finite (independently of 
Q), the number of choices for Q is finite. That is, y := {Q G M : F(Q) = F(T(Q))} is a 
finite set. Comb ining this with (l99l) and the convergence analysis of the sequence {QfcjfceN (see 
Ostrowski . 19661 . Theorem 28.1), we conclude that {QfcjfceN converges to Q. 



At last, we assume that Qxj ^ for all 1 < i < N. We note that I(Q) = and thus Q = Q 
by (1 1001) . The proof for the rate of convergence follows the analysis of generalized Weiszfeld's 
method by lChan and Mulet dl999h (in particular see §6 of that work). We practically need to verify 



Hypotheses 4. 1 and 4.2 (see §4 of that work) and replace the functions F and G in that work by 
F(Q) and 

N 

G(Q, Q*) = £ (HQxillVllCTxill + HQ^II) (102) 

i=l 

respectively. We note that the functions G and G (defined earlier in this work) coincide in the 
following way: G(Q, Qk) = G(Q, Qk) for any k G N (this follows from the fact that QfcXj / 
for all k G N and 1 < i < N; indeed, otherwise for some i, Q 7 X; = for j > k by (l42l) and this 
leads to the contradiction Qx^ = 0). We remark that even though IChan and Mulet dl999h consider 



vector-valued functions, its proof generalizes to matrix- valued functions as here. Furt hermore, we 
can replace the global properties of Hypotheses 4. 1 and 4.2 of IChan and Mulet dl999h by the local 



properties in B(Q, 5q) for any <5o > 0, since the convergence of implies the existence of K$ > 
such that Q/t G B(Q, 5q) for all k > K$. In particular, there is no need to check condition 2 in 
Hypothesis 4.1. Condition 1 in Hypothesis 4.1 holds since -F(Q) is twice differentiable in B(Q, <5o) 
(which follows from the assumption on the limit Q = Q and the continuity of the derivative). 
Conditions 1-3 in Hypothesis 4.2 are verified by the fact that C of Hypothesis 4.2 satisfies C(Q*) = 
Eiii x i x f/HQ* x ill and Q* x i / Q when Q* G B(Q,S ). Condition 3 in Hypothesis 3.1 and 
condition 4 in Hypothesis 4.2 are easy to check. 

7.13 Proof of Theorem [H 

The proof follows from the second part of the proof of TheoremfT2l while using instead of G*(Q, Q*) 
the function 



G S (Q,Q*) = \ E (IIQ x ,H 2 /IIQ* x l |l + IIQ* x t |l)+ E (IIQ X ,I| 2 A* + <V2). 

j=l,|SQ*x 4 ||><5 i=l,||Q* Xi ||<<5 



7.14 Proof of Theorem [H 

We note that the minimization of F(Q) over all Q G M such that QPfx = in Algorithm[3]can be 
performed at each iteration with respect to the projected data: {X) = {P L X! , P^X2 , • • • , F^xat}- 
We note that conditions (O and (|7]) hold for P^(Af) with any L 2 L*. Therefore, Theorem [T] 
implies that ulL* and L D L* in each iteration. Since dim(L) decreases by one in each iteration, 
dim(L) = d in D — d iterations and thus L = L*. 

8. Conclusion 

We proposed an M-estimator for the problems of exact and near subspace recovery. Substantial the- 
ory has been developed to quantify the recovery obtained by this estimator as well as its numerical 
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approximation. Numerical experiments demonstrated state-of-the-art speed and accuracy for our 
corresponding implementation on both synthetic and real data sets. 

This work br oadens the perspecti ve o f two recent gro und-breaking theoretical works for sub- 
space recovery by lCandes et all (|201 lh and lXu et al.l (|2012r) . We hope that it will motivate additional 
approaches to this problem. 

There are many interesting open problems that stem from our work. We believe that by mod- 
ifying or extending the framework described in here, one can even yield better results in various 
scenarios. For example, iLerman et al.l (|2012l) modified (@]) by using instead of H the tightest possi- 
ble convex relaxation of orthogonal projectors. Their algorithm is often very effective for subspace 
recovery when d is known. It will also be interesting to adapt the current framework so that it can 



detect linear structure in the presence of both sparse elementwise corruption (as in ICandes et al. 
(120111)') and the type of outlier s addressed in here. Another direction was recently followed up 
by lCoudron and Lermanl (120121) . where they established exact asymptotic subspace recovery under 
specific sampling assumptions, which may allow relatively large magnitude of noise. It is interest- 
ing to follow this direction and establish exact recovery when using in theory a sequence of IRLS 
regularization parameters {<5i}igN approaching zero (in analogy to the work of iDaubechies et al. 
d2010h \ 
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